Data Prep & Code

Regression Code

Data Preparation

1. Initial Dataset Overview

• We start by loading the dataset (weatherHistory.csv), which contains various weather attributes like temperature, humidity, wind speed, and a binary Precip Type (rain or snow).

• An initial sample of the data is displayed to understand its structure and identify potential cleaning requirements.

2. Data Cleaning and Encoding

• The Precip Type column contains missing values, which could affect model training. We handle this by removing rows with missing values in this column.

• We also encode Precip Type as binary: 1 for “rain” and 0 for “snow”. This encoding is essential for using it as the target variable in classification models.

• After cleaning, a sample of the cleaned data is shown to verify the transformation.

3. Feature Selection

• For the models, we select the most relevant numerical features: Temperature (C), Humidity, Wind Speed (km/h), Visibility (km), and Pressure (millibars).

• The target variable, Precip Type, is the binary label we’ll aim to predict.

4. Splitting Data for Training and Testing

• We split the data into training and testing sets, with 70% of the data for training and 30% for testing. This separation allows us to evaluate the models on unseen data to gauge their predictive accuracy.

• A sample of the training and testing data is displayed to provide a clear view of what the models will train and test on.

Model Implementation

1. Logistic Regression

• Logistic Regression is chosen as one of the models to predict the binary target (Precip Type).

• The model is trained using the training data and then evaluated on the test data. Logistic Regression is suitable here as it provides probability-based classification, allowing us to see how confident it is in predicting “rain” or “snow.”

• The resulting predictions are compared to the actual test labels, with performance measured through a confusion matrix and accuracy score.

2. Multinomial Naïve Bayes

• Since Naïve Bayes typically works well with count or scaled data, we apply Min-Max Scaling to the features, bringing all values into a [0, 1] range for compatibility.

• The scaled training and testing sets are used to train and evaluate a Multinomial Naïve Bayes model.

• This model also provides predictions on the test set, and its performance is measured with a confusion matrix and accuracy score, allowing for comparison with the logistic regression results.

Page updated

Report abuse