Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables, assuming a linear relationship. It predicts the output as a weighted sum of the inputs plus a bias term, aiming to minimize the residual sum of squares between observed and predicted values
Logistic regression is a supervised machine learning algorithm used for binary classification, predicting the probability that an instance belongs to a specific class. It employs the sigmoid function to map predicted values to probabilities between 0 and 1, creating a linear decision boundary in the feature space. Widely applied in fields like healthcare and finance, it is valued for its interpretability and effectiveness in tasks such as disease diagnosis and credit scoring.
Both linear and logistic regression analyze relationships between variables and make predictions. However, linear regression predicts continuous outcomes, while logistic regression predicts categorical outcomes (usually binary). Linear regression uses ordinary least squares for estimation, while logistic regression uses maximum likelihood estimation. Linear regression assumes a linear relationship between variables, while logistic regression uses a logistic function to model probabilities and is better suited for classification tasks. Both methods are fundamental in statistics and machine learning, with linear regression often used in forecasting and trend analysis, while logistic regression is commonly applied in fields like medicine and marketing for binary outcome predictions.
Yes, logistic regression uses the sigmoid function. The sigmoid function, also known as the logistic function, maps any real-valued number to a value between 0 and 1, which is interpreted as a probability. This S-shaped curve is crucial in transforming the linear combination of inputs into a probability output. The sigmoid function is defined mathematically as f(x) = 1 / (1 + e^(-x)), where e is the base of natural logarithms. This function's ability to compress input values into the 0-1 range makes it particularly useful for binary classification problems, allowing logistic regression to estimate the probability of an instance belonging to a specific class.
Maximum likelihood estimation is used in logistic regression to find the best-fitting coefficients for the model. It works by iteratively testing different values for the coefficients to optimize the fit of log odds, maximizing the likelihood of observing the given data under the model's assumptions. This process involves finding the set of parameters that make the observed data most probable, effectively minimizing the difference between predicted and actual outcomes. The likelihood function used in logistic regression is based on the product of probabilities for each observation, and the goal is to maximize this function or, equivalently, its logarithm (log-likelihood) to determine the optimal coefficients.
1. Dataset Overview:
The dataset comprises coal shipment records with key attributes such as ash content, heat content, price, quantity, and sulfur content, along with a categorical target variable coalRankDescription. For this analysis, the target variable has been transformed into a binary classification problem, where:
1 indicates "Bituminous" coal.
0 represents other coal ranks.
This binary classification enables us to apply logistic regression effectively.
2. Binary Target Variable Creation:
The categorical variable coalRankDescription was converted into a binary variable called binary_coal_rank. This transformation simplifies the prediction task for logistic regression.
3. Train-Test Split:
The dataset was split into training (70%) and testing (30%) sets to evaluate the model's generalization ability. This division ensures the model is trained on one subset of the data and tested on a disjoint subset.
INITIAL DATASET
PROCESSED DATASET
The dataset was split into training (70%) and testing (30%) sets using the train_test_split function from the sklearn.model_selection module.
Features (X) were standardized using StandardScaler to ensure uniform scaling.
The splitting process was randomized with a fixed random_state value to make results reproducible.
IMPORTANCE OF CREATING A DISJOINT SPLIT
Prevention of Overfitting:
When a model is trained on a dataset, it learns patterns and relationships specific to that data.
If the same data is used for both training and evaluation, the model's performance might appear artificially high because it has already "seen" the test data during training. This is called data leakage and leads to overfitting.
A disjoint split ensures that the model is tested on completely unseen data, giving a realistic estimate of how it will perform on new, unseen data in the real world.
Replicates Real-World Scenarios:
In practical applications, models encounter new data that they weren’t trained on. The test set simulates this scenario, providing a realistic measure of model performance.
Reliable Performance Metrics:
Metrics like accuracy, precision, recall, and F1-score calculated on the test set provide unbiased estimates of how the model will perform in real-world use cases.
1. Logistic Regression
Logistic Regression was selected as one of the models to classify the binary target variable (binary_coal_rank), which indicates whether the coal shipment is of "Bituminous" type or not.
The model was trained using the training dataset and evaluated on the test dataset. Logistic Regression was chosen because it works well with continuous, standardized numerical data and provides interpretable probability-based predictions for each class.
The model's predictions were compared with the actual test labels, and performance was assessed using a confusion matrix and accuracy score. This analysis highlights the model's effectiveness in distinguishing between the two coal rank categories.
2. Multinomial Naïve Bayes
Multinomial Naïve Bayes (MNB) was applied to the same dataset as a comparative model. MNB typically works best with discrete or count data; hence, the continuous features were scaled and adjusted for compatibility.
The scaled training and test datasets were used to train and evaluate the model. Special preprocessing steps ensured non-negative data values, which are required for MNB to function effectively.
The model's predictions on the test set were evaluated using a confusion matrix and accuracy score. This performance comparison provides insights into how MNB handles the task relative to Logistic Regression.
Accuracy: 67.81%
Key Observations:
Logistic Regression achieved a moderate accuracy of 67.81%, demonstrating its ability to classify the binary coal rank (Bituminous vs. Non-Bituminous) based on continuous features.
True Positives (1180): Correct predictions for Bituminous coal.
True Negatives (2975): Correct predictions for Non-Bituminous coal.
False Positives (1070): Non-Bituminous classified as Bituminous.
False Negatives (902): Bituminous classified as Non-Bituminous.
Logistic Regression benefits from its probabilistic interpretation, allowing for a more granular understanding of predictions.
Accuracy: 67.66%
Key Observations:
Multinomial Naive Bayes achieved a slightly lower accuracy of 67.66%, showcasing its limitations with continuous data even after scaling.
True Positives (531): Correct predictions for Bituminous coal.
True Negatives (3615): Correct predictions for Non-Bituminous coal.
False Positives (430): Non-Bituminous classified as Bituminous.
False Negatives (1551): Bituminous classified as Non-Bituminous.
Multinomial Naive Bayes performed less effectively in capturing the continuous feature relationships compared to Logistic Regression.
Logistic Regression slightly outperformed Multinomial Naive Bayes in terms of accuracy. The bar chart visualization clearly highlights the difference in performance.
Logistic Regression exhibited a higher AUC (Area Under the Curve) of 0.76, indicating better overall classification performance compared to Multinomial Naive Bayes (AUC = 0.69).
The ROC curve visually emphasizes Logistic Regression's superiority in distinguishing between the two coal rank categories.
4. Insights from the Results
Logistic Regression is better suited for continuous data and performs well when the dataset is standardized.
Multinomial Naive Bayes struggled with the continuous nature of the dataset, even after data preprocessing, highlighting its limitation with non-discrete features.
Both models provide valuable insights into the classification of coal ranks but demonstrate distinct strengths:
Logistic Regression: Stronger for capturing continuous relationships.
Multinomial Naive Bayes: Simpler to implement and interpret but less effective for this dataset.
The results provide valuable insights into the classification of coal shipments based on their carbon intensity levels, using logistic regression and multinomial Naive Bayes models. The following key takeaways relate directly to the topic of analyzing and predicting carbon intensity within coal shipments:
1. Feature Importance in Predicting Coal Rank
Both Logistic Regression and Multinomial Naive Bayes relied on features like ash content, sulfur content, quantity, price, and heat content to classify coal into Bituminous (high rank) or non-Bituminous (low rank). These features are critical indicators of coal quality and its environmental impact.
Logistic Regression's higher accuracy (67.81%) confirms that these continuous features align better with logistic regression assumptions, indicating their strong predictive capability for carbon intensity.
2. Behavior of Carbon-Intensive Coal
Logistic Regression's ability to correctly classify Bituminous coal suggests that Bituminous coal shipments have distinct feature values (e.g., higher sulfur and ash content) compared to non-Bituminous coal.
Misclassifications in Multinomial Naive Bayes reveal overlapping distributions in certain feature values, highlighting the challenge in separating shipments with borderline characteristics. This points to potential variability within the data, possibly due to external factors like region, supplier quality, or processing methods.
3. Model Suitability for Carbon Intensity Analysis
Logistic Regression emerged as the more suitable model for analyzing the carbon intensity of coal shipments, thanks to its ability to handle continuous features after scaling.
Multinomial Naive Bayes, while effective for simpler, discrete data, struggled to handle the continuous nature of the coal shipment dataset, leading to lower accuracy and higher false negatives.
4. Practical Implications for Carbon Intensity Prediction
The results demonstrate that Logistic Regression can be effectively used to predict the carbon intensity of future coal shipments. By leveraging this model, stakeholders can identify high-carbon-intensity shipments and make informed decisions to mitigate their environmental impact.
The insights gained from the confusion matrices show that Bituminous (higher carbon intensity) coal is easier to predict accurately, which aligns with the real-world significance of such classifications for emission control strategies.
5. Further Directions
The results underline the need for further analysis, such as exploring non-linear models (e.g., decision trees or ensemble methods), which may capture more intricate relationships between the features and carbon intensity levels.
Feature engineering, including adding geographic or supplier data, could improve classification accuracy and provide deeper insights into the factors influencing coal rank and carbon intensity.
The results reinforced the critical role of quantitative features in analyzing coal shipments and predicting their carbon intensity. Logistic Regression proved to be an effective model, offering actionable insights for energy and environmental stakeholders to identify and manage high-carbon shipments. This analysis paves the way for optimizing future coal supply chains and reducing the environmental impact of high-carbon coal.