1) visualise the data to spot potential issues such as outliers, missing values, or skewness (Exploratory Data Analysis (EDA)
2) address the outliers (see below)
3) address missing values (see below)
4) Data scaling and normalisation especially of an algorithm sensitive to scaling is used (see below).
5) Encoding of categorical features If the dataset contains categorical features, encoding them into numerical representations (e.g., one-hot encoding, label encoding, or target encoding) is a critical preprocessing step (see below).
6) remove correlated features and redundant values:
Correlation Coefficient: Remove features with high pairwise correlation (e.g., ∣ρ∣>0.8|\rho| > 0.8∣ρ∣>0.8).
Feature Importance Analysis: Use models like Random Forests or Lasso regression to assess feature importance.
Statistical tests to assess feature relevance:
ANOVA (Analysis of Variance): Evaluate the impact of categorical variables on a continuous target.
LDA (Linear Discriminant Analysis): Select features that maximize class separation in classification tasks.
Chi-Square Tests (X-tests): Assess the independence of categorical features with respect to the target variable.
Retain only the most relevant features to simplify the model, reduce overfitting, and enhance interpretability.
7) Dimensionality Reduction (Optional) with PCA, tSNE, UMAP
8) Feature engineering
How to deal with missing values?
1) check the distribution of missing values. If they are not randomly distributed, removing or imputation by the mean or median can cause a bias. In this case, more advanced methods are needed.
2) removal is an option in case of large dataset with few missing values, with rows with several missing values.
3) imputation with mean (only normal distributions), median (can deal with skewed distributions), or mode (best categorical). More advanced methods like K-NN and regression can be used to perform imputation.
4) flagging the missing values to allow the machine-learning algorithms from their distribution
How to deal with outliers?
Boxplots, PCA, and other visualization methods can help identify outliers. Assuming a normal distribution, a data point with an absolute Z-score greater than 3 is typically considered an outlier.
There are several options for handling outliers:
Removal:
Outliers can be removed using a heuristic threshold, such as Z-scores greater than 3, the 1.5 × IQR rule, or PCA-based methods.
Capping:
Replace extreme values with a heuristic maximum or minimum value, such as the 5th or 95th percentile, to reduce their impact.
Transformation:
If values cannot be removed or capped because they cannot be confidently identified as outliers, and their presence skews the data, transformations can be applied. Techniques like logarithmic or Box-Cox transformations can bring the distribution closer to normal and stabilize variance (reduce heteroscedasticity).
Robust Machine Learning Algorithms:
Use machine learning algorithms that are inherently robust to outliers, such as decision trees, random forests, or robust regression methods like Huber or RANSAC.
How to scale the data ?
Scaling is a crucial preprocessing step to ensure that features are comparable in magnitude, especially when using algorithms that rely on distances or gradients. Here are the most common scaling techniques:
Normalization (Min-Max Scaling):
Formula: Normalized Value=x−min(x)/max(x)−min(x)
Purpose: Scales the values to a fixed range, typically [0, 1].
When to Use:
Used when the data has varying scales or when distance-based algorithms are employed.
Required for algorithms like k-NN and neural networks.
Example Use: Scaling pixel values in image data or any data where relative scaling matters.
Standardization (Z-score Scaling):
Formula: Standardized Value= (x- mean(x))/std(x)
Purpose: Transforms data to have a mean of 0 and a standard deviation of 1, but the values are not bound to a specific range.
When to Use:
Recommended for algorithms that assume data is normally distributed or for models that depend on Euclidean distances, like SVM, logistic regression, and linear regression.
Example Use: Scaling features in datasets where distribution-based models are applied (e.g., in linear models or SVMs).
Robust Scaling (Robust Standardization):
Formula: Robust Scaled Value=(x−Median(x)) /IQR(x),IQR(x) is the interquartile range (the difference between the 75th and 25th percentiles).
Purpose: Uses the median and interquartile range instead of the mean and standard deviation, making it less sensitive to outliers.
When to Use:
Best suited when the data contains outliers or has heavy skewness.
Useful for models like tree-based algorithms (e.g., Random Forest) and when outliers must not distort the scaling process.
Example Use: Robust scaling is effective when working with data that includes extreme outliers or when feature distributions are heavily skewed.
How to deal with categorical values ?
Categorical features need to be converted into numerical values for machine learning algorithms. The encoding technique used depends on the nature of the data and the algorithm.
One-Hot Encoding:
Description:
One-hot encoding replaces each categorical value with a binary vector, where each vector represents a category with a 1 or 0. A separate column is created for each category.
Example:
For a feature like "Fruit" with values ["banana", "apple", "orange"], one-hot encoding would convert this into three binary columns one for each fruit
2. Label Encoding:
Description:
Label encoding assigns a unique integer to each category, typically based on the lexicographical order of the categories. For example, "small" could be encoded as 0, "medium" as 1, and "tall" as 2.
Suitable for ordinal categories, where there is a meaningful order (e.g., "low", "medium", "high").
Not recommended for nominal categories without any inherent order, as the model might interpret the integer values as having an ordinal relationship, which might not exist.
Limitations:
For nominal categories, the encoded integers might introduce unintended ordinal relationships.
Target Encoding (Mean Encoding):
Description:
Target encoding replaces categorical values with the mean of the target variable for each category. The mean is calculated for each category in the training dataset and then assigned to the corresponding category in the dataset.
Example:
For a feature "Fruit" and a target variable "Price" with values:
When to Use:
Particularly useful when there are many categories and the target variable has a strong relationship with the categories.
Can be especially helpful for high-cardinality categorical features.
Limitations:
Target encoding can lead to data leakage if not done correctly, especially when using it in the training phase without careful cross-validation or during data preprocessing (i.e., using the target in the encoding process).
Overfitting may also occur if the model becomes too reliant on the encoded values for rare categories.
Based on the task
Classification methods:
Logistic Regression: For binary classification (e.g., spam detection, fraud detection, cancer).
Decision Trees: Simple and interpretable, but may overfit. (survival on Titanic)
Random Forests/Ensemble Methods: Random Forests and Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost) are very effective for tabular data with many features and complex relationships.
Support Vector Machines (SVM): Works well for high-dimensional data.
k-Nearest Neighbors (k-NN): Simple, but computationally expensive with large datasets.
Neural Networks (e.g., MLP, CNNs): When you have a large dataset and complex relationships.
Regression methods
Linear Regression: For simple, linear relationships.
Ridge/Lasso Regression: For regularization to avoid overfitting when dealing with many features.
Decision Trees and Random Forests: Handle non-linear relationships and interaction effects well.
Gradient Boosting (XGBoost, LightGBM, CatBoost): Often provide top performance in regression tasks with complex data.
Neural Networks: Suitable when the relationships between features and target are highly complex.
Clustering:
K-Means: Fast and effective for well-separated, spherical clusters.
Hierarchical Clustering: Useful when you need a hierarchical relationship between clusters
DBSCAN: Good for clusters with irregular shapes.
Gaussian Mixture Models (GMM): Useful for probabilistic clustering.
Dimensionality Reduction:
PCA (Principal Component Analysis): For reducing the dimensionality of data with linear relationships.
t-SNE or UMAP: For non-linear dimensionality reduction and visualization of high-dimensional data.
Simple Models (Interpretability, Lower Complexity):
Logistic Regression: Easy to interpret, but limited in capturing complex relationships.
Decision Trees: Simple and interpretable, but prone to overfitting.
Naive Bayes: Simple, interpretable, and effective for text classification.
Complex Models (Less Interpretable, Higher Complexity):
Random Forests: Powerful ensemble method, but less interpretable.
Gradient Boosting (XGBoost, LightGBM): High predictive power, but harder to interpret.
Neural Networks: Powerful but can be opaque; techniques like LIME or SHAP can help interpret deep learning models.
Accuracy: Measures the overall correctness of the model. It's the ratio of correctly predicted instances (both true positives and true negatives) to the total instances. Accuracy=TP+TN/(TP+TN+FP+FN)
TP = True Positives
TN = True Negatives
FP = False Positives
FN = False Negatives
Accuracy works well when the dataset is balanced and the cost of false positives and false negatives is similar. However, it can be misleading in imbalanced datasets.
AUC-ROC (Area Under the Curve - Receiver Operating Characteristic): Measures the model's ability to distinguish between classes at all classification thresholds. It’s a good choice when you have imbalanced data because it takes into account both false positives and false negatives across different thresholds.
True Positive Rate (TPR): Also known as Recall or Sensitivity, it measures the proportion of actual positives that are correctly identified.
TPR=TP/TP+FN
False Positive Rate (FPR): Measures the proportion of actual negatives that are incorrectly identified as positives.
FPR=FP/FP+TN
The ROC curve plots the TPR (sensitivity) against the FPR (1-specificity) for various thresholds. The AUC value tells you how well the model can distinguish between classes. An AUC of 1 means perfect classification, while AUC of 0.5 means the model is no better than random guessing.
F1-Score: The F1-score is the harmonic mean of Precision and Recall, and it is a good metric when you want a balance between them, especially in the presence of class imbalance.
F1-score=2×Recall×Precision/(Recall+Precision)
Recall (TPR) measures how many actual positive instances were identified correctly.
Precision measures how many of the predicted positives were actually correct.
Precision: Measures the proportion of positive predictions that are actually correct. It is important when the cost of false positives is high.
Precision=TP/(TP+FP)
Recall: Measures the proportion of actual positives that are correctly predicted. It is important when the cost of false negatives is high.
Recall=TP/(TP+FN)
F1-Score is typically used in scenarios where both Precision and Recall need to be balanced and is especially useful for imbalanced datasets.
Balanced Data:
Accuracy is suitable when the dataset is balanced and there is no strong preference for one class over the other.
Unbalanced Data (No Preference for Any Class):
AUC-ROC is ideal for evaluating the model's ability to distinguish between classes, even when the dataset is imbalanced.
Unbalanced Data (Preference for a Class):
F1-Score is preferred when you want to balance Precision and Recall, especially when false positives or false negatives have different costs.
In practice, choosing between Accuracy, AUC-ROC, F1-Score, and other metrics depends on the problem you're trying to solve and the costs associated with false positives and false negatives.
Definition: The average of the absolute differences between the predicted values and the actual values.
Formula: MAE= sum(abs ( pred -obs)/number of points
Interpretation: MAE provides a clear interpretation of the average error in the same unit as the target variable. It treats all errors equally, without penalizing larger errors more than smaller ones.
Definition: The average of the squared differences between the predicted values and the actual values.
Formula:
MSE= sum(pred-obs)^2/number of points
Interpretation: MSE penalizes larger errors more than smaller ones due to squaring the differences. It is useful when you want to emphasize larger errors, but can be sensitive to outliers.
Definition: The square root of the mean squared error (MSE). It brings the error back to the original units of the target variable.
Formula:
RMSE=MSE=sqrt(1/number(sum(pred-obs)^2)
Definition: The proportion of the variance in the dependent variable that is predictable from the independent variables. It represents how well the regression model fits the data.
Formula: R2= 1- (sum(pred−obs)^2)/ sum(obs- mean(obs)^2))
Interpretation:
R^2=1 means perfect fit, where the model explains all the variance.
R^2 = 0 means the model does not explain any variance, and is as good as predicting the mean of the target variable for all data points.
Negative R² occurs when the model performs worse than the baseline (predicting the mean).
Definition: A version of R2 that adjusts for the number of predictors in the model. It penalizes the inclusion of irrelevant variables.
Formula:
Adjusted R2=1−((1−R2)*(n−1)/(n−p−1)
Where:
n = Number of data points
p = Number of predictors (independent variables)
Interpretation: Adjusted R^2 is useful when comparing models with different numbers of predictors, as it adjusts for overfitting and prevents an inflated R^2 score when adding unnecessary features.
Definition: The average of the absolute percentage differences between the predicted values and the actual values.
Formula:
MAPE=1/n * sum( abs(obs−pred)/obs) ×100
These metrics assess the quality of the clusters based on the intrinsic characteristics of the data, such as the distances between data points and their assigned cluster centers.
a) Silhouette Score:
Definition: Measures how similar each point is to its own cluster compared to other clusters. It is based on the average distance between points in the same cluster (cohesion) and the average distance between points in different clusters (separation).
Formula: S(i)=b(i)−a(i)/max(a(i),b(i))
Where:
a(i) is the average distance from point i to other points in the same cluster.
b(i) is the average distance from point i to points in the nearest cluster.
Interpretation: The silhouette score ranges from -1 to +1:
+1: Point is well-clustered.
0: Point is on or very close to the decision boundary between two clusters.
-1: Point is poorly clustered.
b) Davies-Bouldin Index:
Definition: Measures the average similarity ratio of each cluster with the cluster that is most similar to it. It evaluates both the compactness and separation of the clusters.
Formula: DB=1/n * sum (max (σi+σj)) /d(ci,cj))
Where:
σi is the average distance between points in cluster i (measure of cluster compactness).
d(ci,cj) is the distance between the centroids of clusters i and j (measure of separation).
Interpretation: Lower values of the Davies-Bouldin index indicate better clustering (more compact and well-separated clusters).
c) Dunn Index:
Definition: Measures the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. The higher the Dunn index, the better the clustering.
Formula: D=min (d(ci,cj)/ max(δ(ci))
Where:
d(ci,cj) is the distance between the centroids of clusters ci and cj.
δ(ci) is the maximum distance between points within cluster ci.
Interpretation: A higher Dunn index indicates better clustering (higher separation between clusters and lower compactness within clusters).
d) Inertia (or Within-Cluster Sum of Squares, WCSS):
Definition: The sum of squared distances between each data point and the centroid of its assigned cluster. It measures how tight the clusters are.
Formula:
Inertia=sum( centroid - each point of the cluster)^2
Interpretation: Lower inertia values indicate better clustering, with more compact clusters. However, inertia alone is not sufficient to assess clustering quality because it tends to decrease with more clusters.
e) Calinski-Harabasz Index (Variance Ratio Criterion):
Definition: Measures the ratio of the sum of between-cluster dispersion to within-cluster dispersion. A higher score indicates that clusters are well-separated and compact.
Formula:
CH=Bk/Wk⋅(n−k)/k−1
Where:
Bk is the between-cluster dispersion matrix.
Wk is the within-cluster dispersion matrix.
n is the number of points, and k is the number of clusters.
Interpretation: A higher Calinski-Harabasz score suggests better clustering (well-separated and compact clusters).
a) Adjusted Rand Index (ARI):
Definition: Measures the similarity between two data clusterings by considering all pairs of data points. It corrects for the chance grouping of points.
Formula: ARI=RI−E[RI]/max(RI)−E[RI]
Where RI is the Rand index, and E[RI] is the expected Rand index under random cluster assignments.
Interpretation: ARI ranges from -1 to +1:
+1 indicates perfect agreement between the predicted clusters and the true labels.
0 indicates random clustering.
Negative values indicate worse than random clustering.
b) Adjusted Mutual Information (AMI):
Definition: Measures the amount of information shared between the predicted clusters and the true labels, adjusting for chance.
Formula:
AMI=I(U,V)−E[I(U,V)]/max(I(U,V))−E[I(U,V)]
Where I(U,V) is the mutual information between clusters UUU and the true labels VVV.
Interpretation: AMI ranges from 0 to 1, with 1 indicating perfect agreement between the predicted clusters and the true labels, and 0 indicating no mutual information.
c) V-Measure:
Definition: Measures how well the clustering matches the ground truth labels by calculating both the homogeneity (degree to which data points in the same cluster have the same label) and completeness (degree to which data points with the same label are assigned to the same cluster).
Formula:
V=2⋅Homogeneity⋅Completeness/Homogeneity+Completeness
Interpretation: V-Measure ranges from 0 to 1, with 1 indicating perfect homogeneity and completeness.
Internal Metrics: Evaluate the quality of clusters based on the data itself (no ground truth required). Examples: Silhouette Score, Davies-Bouldin Index, Dunn Index, Inertia, Calinski-Harabasz Index.
External Metrics: Compare the clustering result with a known ground truth (requires labels). Examples: Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), V-Measure.
These metrics help assess the effectiveness of clustering and guide the selection of the best clustering algorithm and parameters for a given dataset.
Unbalanced datasets can bias machine learning models towards the majority class, leading to suboptimal performance for the minority class. Here are strategies to address this issue:
1. Data-Level Techniques
These methods involve adjusting the dataset to balance the classes before training.
a) Undersampling:
Definition: Reduce the number of samples in the majority class to match the size of the minority class.
Advantages:
Simplifies the dataset, reducing training time.
Disadvantages:
Risk of losing valuable information from the majority class, potentially leading to underfitting.
Techniques:
Random undersampling: Randomly remove majority class samples.
Cluster-based undersampling: Use clustering to identify and retain representative samples.
b) Oversampling:
Definition: Increase the number of samples in the minority class to balance the dataset.
Advantages:
Retains all majority class information while improving representation of the minority class.
Disadvantages:
Risk of overfitting, as new samples are artificially generated.
Techniques:
Random Oversampling: Duplicate random samples from the minority class.
SMOTE (Synthetic Minority Oversampling Technique):
Generates synthetic samples by interpolating between existing minority class samples.
ADASYN (Adaptive Synthetic Sampling): A variant of SMOTE that focuses on generating synthetic samples for minority class regions that are harder to learn.
2. Algorithm-Level Techniques
Modify the learning algorithm to account for class imbalance.
a) Class Weights:
Definition: Assign higher weights to the minority class during training to penalize misclassification more heavily.
Advantages:
Effective without altering the dataset.
Implementation:
Many algorithms, such as Logistic Regression, Random Forest, and SVM, allow weight adjustments via hyperparameters (e.g., class_weight='balanced' in scikit-learn).
b) Specialized Algorithms:
Use algorithms designed to handle imbalance, such as:
BalancedRandomForestClassifier.
XGBoost or LightGBM with class imbalance options (e.g., scale_pos_weight).
3. Evaluation Metrics for Imbalanced Data
Accuracy is often misleading for imbalanced datasets, so alternative metrics should be used:
a) Precision:
Definition: Measures the proportion of true positive predictions among all positive predictions.
When to use: When false positives are costly.
b) Recall (True Positive Rate):
Definition: Measures the proportion of true positives among all actual positives.
When to use: When false negatives are costly.
c) F1-Score:
Definition: Harmonic mean of precision and recall, providing a balance between the two.
F1=2⋅Precision⋅Recall/(Precision+Recall)
When to use: When both false positives and false negatives are important.
d) ROC-AUC:
Definition: Measures the trade-off between true positive rate (TPR) and false positive rate (FPR) across thresholds.
When to use: When you need a general view of model performance.
e) PR-AUC (Precision-Recall AUC):
Definition: Measures the trade-off between precision and recall across thresholds.
When to use: When the minority class is the focus.
4. Advanced Techniques
a) Ensemble Methods:
Use ensembles like bagging and boosting to improve performance on imbalanced datasets.
Example: EasyEnsemble (builds multiple balanced datasets by undersampling and trains multiple models).
b) Cost-Sensitive Learning:
Directly incorporates the cost of misclassification into the learning algorithm.
c) Anomaly Detection:
Treat the minority class as an anomaly and use anomaly detection techniques to identify it.
d) Data Augmentation:
Augment data for the minority class using techniques like rotation, flipping, or noise addition (common in image or text datasets).
Use undersampling or oversampling to balance the dataset.
Leverage class weights or specialized algorithms to handle imbalance directly.
Focus on evaluation metrics like F1-score, ROC-AUC, or PR-AUC to assess model performance meaningfully.
Consider ensemble methods and advanced sampling techniques for more complex problems.
n-Fold Cross-Validation:
n-fold cross-validation is a technique used to evaluate how well a model generalizes to unseen data and detect overfitting.
In this approach:
The dataset is split into n subsets (or "folds").
For each fold, the model is trained using n-1 folds (training set) and tested on the remaining 1 fold (test set).
This process is repeated n times, with each fold being used once as the test set, and the model is evaluated using a chosen performance metric (accuracy, precision, recall, etc.).
The overall performance is averaged across all n iterations.
Signs of Overfitting:
Overfitting occurs when a model performs well on the training data but poorly on the test data, indicating that the model has memorized the training data (high variance) instead of learning general patterns.
In cross-validation:
If the model performs significantly better on the training set than on the test set (i.e., low test set performance), it suggests that the model is overfitting the training data.
Deal with overfitting
Deal with Overfitting:
Add More Data:
More data can help the model learn general patterns better and reduce overfitting. With more data, the model has less chance of memorizing specific details from the training set, and it can generalize better.
Use a Simplified Model:
A simplified model with fewer parameters is less likely to overfit. For example, using a model with fewer features, less complex algorithms, or reducing the depth of decision trees can help prevent overfitting.
Simplification can also be done by reducing the model’s capacity, such as using linear models instead of non-linear ones (e.g., switching from a deep neural network to a shallow one).
Use Regularization:
Regularization techniques penalize large model coefficients or weights, thus reducing the model's complexity and preventing overfitting:
Lasso (L1 regularization): It adds a penalty equivalent to the absolute value of the magnitude of coefficients. This can result in some coefficients becoming zero, effectively performing feature selection.
Ridge (L2 regularization): It adds a penalty equivalent to the square of the magnitude of coefficients, which reduces the magnitude of the coefficients but doesn’t force them to zero.
ElasticNet: A combination of both L1 and L2 regularization, which can be useful when there are correlations between features.
4. Feature selections: tt works by removing irrelevant or redundant features from the model, simplifying its structure, and helping it generalize better to unseen data
5. Other Techniques:
Early Stopping: In iterative models like neural networks, training can be stopped early if performance on the validation set starts to degrade, even if the training set performance is still improving.
Cross-Validation: Use k-fold cross-validation regularly to monitor the model’s performance on multiple validation sets and ensure it generalizes well.
Purpose: Reduce variance and avoid overfitting.
Method:
Multiple models (usually of the same type) are trained independently on different subsets of the data.
These subsets are created through bootstrapping (sampling with replacement).
Each model is trained in parallel.
The final prediction is the average (for regression) or majority vote (for classification) of the individual models' predictions.
Key Point: The models do not interact with each other. Examples: Random Forest is a well-known bagging algorithm.
Purpose: Reduce bias by focusing on correcting errors made by previous models.
Method:
Models are trained sequentially, with each new model focusing on the errors of its predecessors.
Data points that were incorrectly predicted by earlier models are given more weight so that the subsequent models pay more attention to them.
The final prediction is a weighted combination of all the models.
Key Point: Models are dependent on each other, as later models build upon the performance of earlier ones. Examples: AdaBoost, Gradient Boosting, and XGBoost.