1. Initial Dataset Overview
• We start by loading the dataset (weatherHistory.csv), which contains various weather attributes like temperature, humidity, wind speed, and a binary Precip Type (rain or snow).
• An initial sample of the data is displayed to understand its structure and identify potential cleaning requirements.
2. Data Cleaning and Encoding
• The Precip Type column contains missing values, which could affect model training. We handle this by removing rows with missing values in this column.
• We also encode Precip Type as binary: 1 for “rain” and 0 for “snow”. This encoding is essential for using it as the target variable in classification models.
• After cleaning, a sample of the cleaned data is shown to verify the transformation.
3. Feature Selection
• For the models, we select the most relevant numerical features: Temperature (C), Humidity, Wind Speed (km/h), Visibility (km), and Pressure (millibars).
• The target variable, Precip Type, is the binary label we’ll aim to predict.
4. Splitting Data for Training and Testing
• We split the data into training and testing sets, with 70% of the data for training and 30% for testing. This separation allows us to evaluate the models on unseen data to gauge their predictive accuracy.
• A sample of the training and testing data is displayed to provide a clear view of what the models will train and test on.
1. Logistic Regression
• Logistic Regression is chosen as one of the models to predict the binary target (Precip Type).
• The model is trained using the training data and then evaluated on the test data. Logistic Regression is suitable here as it provides probability-based classification, allowing us to see how confident it is in predicting “rain” or “snow.”
• The resulting predictions are compared to the actual test labels, with performance measured through a confusion matrix and accuracy score.
2. Multinomial Naïve Bayes
• Since Naïve Bayes typically works well with count or scaled data, we apply Min-Max Scaling to the features, bringing all values into a [0, 1] range for compatibility.
• The scaled training and testing sets are used to train and evaluate a Multinomial Naïve Bayes model.
• This model also provides predictions on the test set, and its performance is measured with a confusion matrix and accuracy score, allowing for comparison with the logistic regression results.