GOAL : Data processing
Ch4 : Learning how to process numerical features.
Ch5 : Learning how to Handling categorical features.
Learning experience:
Ch4 has learn the Scikit-learn's function to process the numerical features, including preprocessing/covariance/cluster/impute.
Ch5 has learn use Scikit-learn's function to handling categorical features, including encoding/missing value/imbalanced classes.
working environment :
OS: Windows 11 home
CPU : intel i9-13900k
GPU : Nvidia RTX 4090
Python Version : 3.12.2
Development environment: jupyter notebook.
4.0 Introduction
Quantitative data is the measurement of something. The natural way to represent these quantities is numerically (e.g., 29 students, $529,392 in sales). In this chapter, we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms.
4.1 Rescaling a Feature
This section will rescale the values of a numerical feature to be between two values.
class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False) Transform features by scaling each feature to a given range. [2]
In 1 snippet is used to perform min-max scaling on a feature. First, we import the necessary libraries including numpy and the preprocessing module from sklearn.
Within the code, we create an array containing the feature data. Then, a MinMaxScaler object is instantiated with the specified feature range of 0 to 1.
Next, the fit_transform method is applied to scale the feature, transforming each value in the feature array to fall within the range of 0 to 1.
Finally, the scaled feature array (scaled_feature) is displayed. This scaling process is commonly employed to standardize the range of feature values, thus preventing certain features from exerting undue influence on the model.
The output is show in below, we can also see the calculation process.
In 2 we change the range from -1 to 1, we can see the output are rescalling from -1 to 1.
Note : scikit-learn’s MinMaxScaler offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, and then use transform to rescale the feature. The second option is to use fit_transform to do both operations at once. There is no mathematical difference between the two options, but there is sometimes a practical benefit to keeping the operations separate because it allows us to apply the same transformation to different sets of the data.
4.2 Standardizing a Feature
This section will transform a feature to have a mean of 0 and a standard deviation of 1.
class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True) Standardize features by removing the mean and scaling to unit variance. [3]
In3 , firstly, an array x is created containing numerical values representing different observations.
Subsequently, a standardization scaler object is instantiated using the StandardScaler class. This scaler is designed to transform the feature values to have a mean of 0 and a standard deviation of 1, effectively conforming them to a standard normal distribution.
Then, the fit_transform method is applied to standardize the feature. This method calculates the mean and standard deviation of the feature and transforms each value accordingly to adhere to the standard normal distribution.
Finally, the standardized feature is displayed.
In4 we can find the mean and std after standardized and the mean is 0, std is 1.
If our data has significant outliers, it can negatively impact our standardization by affecting the feature’s mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the RobustScaler method in In 5
class sklearn.preprocessing.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False) Scale features using statistics that are robust to outliers. [4]
4.3 Normalizing Observations
This section will rescale the feature values of observations to have unit norm (a total length of 1).
class sklearn.preprocessing.Normalizer(norm='l2', *, copy=True) Normalize samples individually to unit norm. [5]
In 28 firstly, a feature matrix features is created, containing multiple features. Each row represents a sample, and each column represents a feature.
Next, a Normalizer object is instantiated, specifying the normalization method (norm) as "l2", which means using L2 norm (Euclidean distance) for normalization. The L2 norm is computed by summing the squares of each element in the vector and then taking the square root.
Then, the transform method is applied to normalize the feature matrix. This process transforms each feature vector of a sample into a vector with a length of 1, effectively placing them on the unit sphere.
Finally, we obtain the normalized feature matrix, where each sample's feature vector is transformed into a vector of the same length, facilitating easier comparisons or computations.
Alternatively, we can specify Manhattan norm (L1): the fourth figure show the formula.
In 32 the transform method is applied to the feature matrix features, which normalizes each sample's feature vector using the L1 norm.
Finally, the normalized feature matrix features_l1_norm is obtained, where each sample's feature vector is transformed into a vector with the sum of absolute values equal to 1, effectively placing them on the L1 unit ball.
Practically, notice that norm="l1" rescales an observation’s values so they sum to 1, which can sometimes be a desirable quality:
Note : Normalizer rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g., text classification when every word or n-word group is a feature). Normalizer provides three norm options with Euclidean norm (often called L2) being the default argument, the second figure show the formula.
4.4 Generating Polynomial and Interaction Features
This section will create polynomial and interaction features.
class sklearn.preprocessing.PolynomialFeatures(degree=2, *, interaction_only=False, include_bias=True, order='C') , Generate polynomial and interaction features. [6]
In 34 first, a feature matrix features is created, containing multiple features. Each row represents a sample, and each column represents a feature.
Next, a PolynomialFeatures object is instantiated with the specified degree of the polynomial features to be generated, set here as 2. Additionally, the include_bias parameter is set to False, indicating that no bias (constant) term should be generated.
Then, the fit_transform method is applied to the feature matrix to generate polynomial interaction terms.
Finally, we obtain a new feature matrix containing the original features along with their second-degree interaction terms.
In 35 restrict the features created to only interaction features by setting interaction_only to True.
Note : Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target.
4.5 Transforming Features
This section will make a custom transformation to one or more features.
class sklearn.preprocessing.FunctionTransformer(func=None, inverse_func=None, *, validate=False, accept_sparse=False, check_inverse=True, feature_names_out=None, kw_args=None, inv_kw_args=None) , Constructs a transformer from an arbitrary callable.[7]
In 36 first, a feature matrix features is created, containing multiple features. Each row represents a sample, and each column represents a feature.
Next, a simple function add_ten is defined. This function takes an input integer and returns the integer plus 10.
Then, a FunctionTransformer object is created, and the function add_ten is passed to it.
Finally, the transform method is used to transform the feature matrix features, applying the add_ten function to each element in the feature matrix and adding 10 to its original value.
In 37 we can create the same transformation in pandas using apply.
4.6 Detecting Outliers
This section will identify extreme observations. Detecting outliers is unfortunately more of an art than a science. However, a common method is to assume the data is normally distributed and, based on that assumption, “draw” an ellipse around the data, classifying any observation inside the ellipse as an inlier (labeled as 1) and any observation outside the ellipse as an outlier (labeled as -1):
class sklearn.covariance.EllipticEnvelope(*, store_precision=True, assume_centered=False, support_fraction=None, contamination=0.1, random_state=None), An object for detecting outliers in a Gaussian distributed dataset. [8]
In 38 we generate simulated data using the make_blobs function, creating a dataset with 10 samples and 2 features. These data are generated around a central point, resulting in most samples being densely packed in one area.
Next, we replace the feature values of the first sample with extreme values (10000), simulating an outlier.
Then, we create an EllipticEnvelope object as an outlier detector. Here, we set the contamination parameter to 0.1, indicating the expected proportion of outliers in the data.
Subsequently, we use the fit method to fit the detector to the data, allowing it to learn the pattern of normal data.
Finally, we use the predict method to predict outliers for each sample in the dataset, returning 1 for inliers (normal data points) and -1 for outliers. In these arrays, values of -1 refer to outliers whereas values of 1 refer to inliers.
Note : If we expect our data to have few outliers, we can set contamination to something small. However, if we believe that the data is likely to have outliers, we can set it to a higher value.
Instead of looking at observations as a whole, we can instead look at individual features and identify extreme values in those features using interquartile range (IQR), In 39 first select the first feature from the previously generated feature matrix and store it in a variable named feature.
Next, a function indicies_of_outliers is defined to find the indices of outliers in the feature. This function calculates the first quartile (Q1) and the third quartile (Q3) of the feature, and then computes the upper and lower bounds of outliers using 1.5 times the interquartile range (IQR). Finally, it uses the np.where function to find the indices of values in the feature that fall outside of these bounds.
Finally, we apply this function to the feature named feature and return a NumPy array containing the indices of outliers.
IQR is the difference between the first and third quartile of a set of data. You can think of IQR as the spread of the bulk of the data, with outliers being observations far from the main concentration of data. Outliers are commonly defined as any value 1.5 IQRs less than the first quartile, or 1.5 IQRs greater than the third quartile.
Note : There is no single best technique for detecting outliers. Instead, we have a collection of techniques all with their own advantages and disadvantages. Our best strategy is often trying multiple techniques (e.g., both EllipticEnvelope and IQR-based detection) and looking at the results as a whole.
4.7 Handling Outliers
This section will learn when you have outliers in your data that you want to identify and then reduce their impact on the data distribution.
In 40 we drop the data bathrooms > 20, so the output won't show the outliers.
In 41 we can mark them as outliers and include “Outlier” as a feature, so the outliers will be marked.
In 43 we can transform the feature to dampen the effect of the outlier, so the Log of Square Feet won't be too high.
Note : Similar to detecting outliers, there is no hard-and-fast rule for handling them.How we handle them should be based on two aspects. First, we should consider what makes them outliers. If we believe they are errors in the data, such as from a broken sensor or a miscoded value, then we might drop the observation or replace outlier values with NaN since we can’t trust those values. However, if we believe the outliers are genuine extreme values (e.g., a house [mansion] with 200 bathrooms), then marking them as outliers or transforming their values is more appropriate. Second, how we handle outliers should be based on our goal for machine learning. For example, if we want to predict house prices based on features of the house, we might reasonably assume the price for mansions with over 100 bathrooms is driven by a different dynamic than regular family homes. Furthermore, if we are training a model to use as part of an online home loan web application, we might assume that our potential users will not include billionaires looking to buy a mansion. So what should we do if we have outliers? Think about why they are outliers, have an end goal in mind for the data, and, most importantly, remember that not making a decision to address outliers is itself a decision with implications. One additional point: if you do have outliers, standardization might not be appropriate because the mean and variance might be highly influenced by the outliers. In this case, use a rescaling method more robust against outliers, like RobustScaler.
4.8 Discretizating Features
This section will learn when you have a numerical feature and want to break it up into discrete bins.
In 44 we can binarize the feature according to some threshold, so the output will be set to 0 if <18 else will be set to 1.
In 45 we can break up numerical features according to multiple thresholds, so the output will be split to four region. ( [0]|20|[1]|30|[2]|64|[3] ) .
In 46 note that the arguments for the bins parameter denote the left edge of each bin, we can switch this behavior by setting the parameter right to True. so the 20 will be set to 0.
4.9 Grouping Observations Using Clustering
This section will cluster observations so that similar observations are grouped together.
class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd'), K-Means clustering. [9] [10]
In 47 firstly, the make_blobs function generates a simulated dataset with 50 samples and 2 features. These samples are divided into 3 clusters, and these clusters are generated around some central points, with each cluster's distribution being distinct from others.
Next, the feature matrix is converted into a Pandas DataFrame using the pd.DataFrame function for ease of further operations.
Then, a KMeans object is created, with the number of clusters set to 3. The random seed (random_state) is set to 0 to ensure reproducibility of results.
Then, the fit method is used to fit the KMeans object to the feature matrix to find the centroids of the clusters.
Finally, the predict method is used to predict the cluster each sample belongs to, and these predicted results are stored in a new column "group" in the DataFrame.
Lastly, we view the first 5 observations in the DataFrame, displaying the feature values of each sample along with the cluster label it belongs to.
In 58 first utilizes t-SNE to reduce the feature space to two dimensions. Subsequently, it plots the data points from the reduced feature space on a scatter plot, employing different colors to represent distinct clustering groups.
Note : we can use clustering as a preprocessing step. Specifically, we use unsupervised learning algorithms like k-means to cluster observations into groups. The result is a categorical feature with similar observations being members of the same group. Ch19 will talk in detail of clustering.
4.10 Deleting Observations with Missing Values
This section will delete observations containing missing values.
numpy.isnan(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj]) = <ufunc 'isnan'> , Test element-wise for NaN and return result as a boolean array. [11]
DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False) , Remove missing values. [12]
In 59 we can deleting observations with missing values is easy with a clever line of NumPy, so the missing values will be delete.
In 60 we can also drop missing observations using pandas, so the missing values will be delete.
Note : Most machine learning algorithms cannot handle any missing values in the target and feature arrays. For this reason, we cannot ignore missing values in our data and must address the issue during preprocessing.
Just as important, depending on the cause of the missing values, deleting observations can introduce bias into our data. There are three types of missing data:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
It is sometimes acceptable to delete observations if they are MCAR or MAR. However, if the value is MNAR, the fact that a value is missing is itself information. Deleting MNAR observations can inject bias into our data because we are removing observations produced by some unobserved systematic effect.
4.11 Imputing Missing Values
This section will learn when you have missing values in your data and want to impute them via a generic method or prediction.
class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False, keep_empty_features=False), Imputation for completing missing values using k-Nearest Neighbors. [13]
class sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False), Univariate imputer for completing missing values with simple strategies. [14]
In 61 first generates a simulated feature matrix using the make_blobs function, consisting of 1000 samples and 2 features. The data is generated with a specified random seed (random_state) to ensure reproducibility of results.
Next, the features are standardized, meaning each feature's mean is scaled to 0 and variance to 1. This ensures that each feature has the same scale, which can aid in training models and performance.
Then, the first value of the first feature in the feature matrix is replaced with a missing value (NaN), simulating a scenario where missing values might exist in real data.
Subsequently, the KNNImputer class is used to predict missing values in the feature matrix. The KNNImputer fills missing values by looking at the values of the nearest neighbors. Here, we specify the number of neighbors to be 5.
Finally, we compare the true value with the imputed value and print the results to the console.
Note : If you have a small amount of data, predict and impute the missing values using k-nearest neighbors.
In 62 Subsequently, a SimpleImputer object is created using the "mean" strategy. This means that missing values will be replaced with the mean of that feature.
Finally, the fit_transform method is used to impute missing values in the feature matrix, and the true value is compared with the imputed value, with the results printed to the console.
Note : we can use scikit-learn’s SimpleImputer class from the imputer module to fill in missing values with the feature’s mean, median, or most frequent value. However, we will typically get worse results than with KNN.
There are two main strategies for replacing missing data with substitute values, each of which has strengths and weaknesses.The downside to KNN is that in order to know which observations are the closest to the missing value, it needs to calculate the distance between the missing value and every single observation. An alternative and more scalable strategy than KNN is to fill in the missing values of numerical data with the mean, median, or mode.
5.0 Introduction
It is often useful to measure objects not in terms of their quantity but in terms of some quality. We frequently represent qualitative information in categories such as gender, colors, or brand of car. Sets of categories with no intrinsic ordering are called nominal. When a set of categories has some natural ordering we refer to it as ordinal. In this chapter we will cover techniques for making this transformation as well as overcoming other challenges often encountered when handling categorical data.
5.1 Encoding Nominal Categorical Features
This section will learn when you have a feature with nominal classes that has no intrinsic ordering (e.g., apple, pear, banana), and you want to encode the feature into numerical values.
class sklearn.preprocessing.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False) Binarize labels in a one-vs-all fashion. [16]
class sklearn.preprocessing.MultiLabelBinarizer(*, classes=None, sparse_output=False) Transform between iterable of iterables and a multilabel format. [17]
In 63 we can encode the feature using scikit-learn’s LabelBinarizer , In 64 we can use the classes_ attribute to output the classes, In 65 If we want to reverse the one-hot encoding, we can use inverse_transform, In 66 we can even use pandas to one-hot encode the feature, In 67 One helpful feature of scikit-learn is the ability to handle a situation where each observation lists multiple classes, In 68 once again, we can see the classes with the classes_ method.
Note : We might think the proper strategy is to assign each class a numerical value (e.g., Texas = 1, California = 2). However, when our classes have no intrinsic ordering (e.g., Texas isn’t “less” than California), our numerical values erroneously create an ordering that is not present.
The proper strategy is to create a binary feature for each class in the original feature. This is often called one-hot encoding (in machine learning literature) or dummying (in statistical and research literature). Our solution’s feature was a vector containing three classes (i.e., Texas, California, and Delaware). In one-hot encoding, each class becomes its own feature with 1s when the class appears and 0s otherwise. Because our feature had three classes, one-hot encoding returned three binary features (one for each class). By using one-hot encoding we can capture the membership of an observation in a class while preserving the notion that the class lacks any sort of hierarchy.
Finally, it is often recommended that after one-hot encoding a feature, we drop one of the one-hot encoded features in the resulting matrix to avoid linear dependence.
5.2 Encoding Ordinal Categorical Features
This section will learn when you have an ordinal categorical feature (e.g., high, medium, low), and you want to transform it into numerical values.
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method=_NoDefault.no_default), Replace values given in to_replace with value. [18]
In 70 we can use the pandas DataFrame replace method to transform string labels to numerical equivalents, so we can see the output is 1(low) 1(low) 2(medium) 2(medium) 3(high). In 71 Often we have a feature with classes that have some kind of natural ordering. A famous example is the Likert scale, It is important that our choice of numeric values is based on our prior information on the ordinal classes. In our solution, high is literally three times larger than low. This is fine in many instances but can break down if the assumed intervals between the classes are not equal, In 72 In this example, the distance between Low and Medium is the same as the distance between Medium and Barely More Than Medium, which is almost certainly not accurate. The best approach is to be conscious about the numerical values mapped to classes.
Note : This warning message appears in Python code as a FutureWarning, indicating that certain usage in future versions of Pandas will be removed and provides some suggested solutions.
5.3 Encoding Dictionaries of Features
This section will learn when you have a dictionary and want to convert it into a feature matrix.
class sklearn.feature_extraction.DictVectorizer(*, dtype=<class 'numpy.float64'>, separator='=', sparse=True, sort=True), Transforms lists of feature-value mappings to vectors. [19]
In 79 Firstly, we create a list data_dict containing multiple dictionaries, where each dictionary represents a sample, with keys being the names of features and values being the feature values.
Next, we create a DictVectorizer object and apply it to data_dict. We set the parameter sparse=False to get a dense feature matrix.
Then, we use the fit_transform method to transform data_dict into the feature matrix features, where each row represents a sample and each column represents a feature, with the values in the matrix being the corresponding feature values.
Finally, we view the transformed feature matrix features.
By default DictVectorizer outputs a sparse matrix that only stores elements with a value other than 0. This can be very helpful when we have massive matrices (often encountered in natural language processing) and want to minimize the memory requirements. We can force DictVectorizer to output a dense matrix using sparse=False.
In 84 We can get the names of each generated feature using the get_feature_names method, In 85 While not necessary, for the sake of illustration we can create a pandas DataFrame to view the output better. The output show that each row corresponds to a sample, each column corresponds to a feature, and the values in the matrix represent the values of the features.
In 86 This is a common situation when working with natural language processing. For example, we might have a collection of documents and for each document we have a dictionary containing the number of times every word appears in the document. Using DictVectorizer, we can easily create a feature matrix where every feature is the number of times a word appears in each document.
Note : In 84 : get_feature_names_out is new version method.
5.4 Imputing Missing Class Values
This section will learn when you have a categorical feature containing missing values that you want to replace with predicted values.
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None), Classifier implementing the k-nearest neighbors vote. [20]
The ideal solution is to train a machine learning classifier algorithm to predict the missing values, commonly a k-nearest neighbors (KNN) classifier.
In 87 first, we create a feature matrix X with a categorical feature and another feature matrix X_with_nan with missing values in the categorical feature. These feature matrices contain numerical features along with one categorical feature.
Next, we train a KNN classifier. Here, we set the parameters of KNeighborsClassifier, with K value as 3 and weight set to 'distance'. Then, we use the numerical features (i.e., X[:,1:]) as the training data for the model, and the categorical feature (i.e., X[:,0]) as the target variable.
Then, we use the trained model to predict the class of the features with missing values. For each row in X_with_nan, we extract the numerical features (i.e., X_with_nan[:,1:]), and use the trained model to predict its corresponding class.
Next, we combine the predicted class values with their other features, forming a feature matrix X_with_imputed with the missing values imputed. Thus, we obtain a complete feature matrix.
Finally, we vertically stack the imputed feature matrix X_with_imputed with the original feature matrix X, forming the final feature matrix. The red circle is the predicted value.
An alternative solution is to fill in missing values with the feature’s most frequent value
In 88 we create a SimpleImputer object with the strategy set to 'most_frequent', meaning missing values will be replaced with the most frequent value for that feature.
Next, we use the fit_transform method to impute the complete feature matrix X_complete, replacing missing values with the most frequent values for their corresponding features.
Finally, we obtain a feature matrix with missing values imputed, where all missing values have been replaced with the most frequent values for their respective features. The red circle is the predicted value.
Note : When we have missing values in a categorical feature, our best solution is to open our toolbox of machine learning algorithms to predict the values of the missing observations. We can accomplish this by treating the feature with the missing values as the target vector and the other features as the feature matrix. KNN and the most frequent class are the commonly used algorithm .
5.5 Handling Imbalanced Classes
This section will learn when you have a target vector with highly imbalanced classes, and you want to make adjustments so that you can handle the class imbalance.
Collect more data. If that isn’t possible, change the metrics used to evaluate your model. If that doesn’t work, consider using a model’s built-in class weight parameters (if available), downsampling, or upsampling. We cover evaluation metrics in a later chapter, so for now let’s focus on class weight parameters, downsampling, and upsampling.
In 89 To demonstrate our solutions, we need to create some data with imbalanced classes. Fisher’s Iris dataset contains three balanced classes of 50 observations, each indicating the species of flower (Iris setosa, Iris virginica, and Iris versicolor). To unbalance the dataset, we remove 40 of the 50 Iris setosa observations and then merge the Iris virginica and Iris versicolor classes. The end result is a binary target vector indicating if an observation is an Iris setosa flower or not. The result is 10 observations of Iris setosa (class 0) and 100 observations of not Iris setosa (class 1).
In 90 Many algorithms in scikit-learn offer a parameter to weight classes during training to counteract the effect of their imbalance. While we have not covered it yet, RandomForestClassifier is a popular classification algorithm and includes a class_weight parameter; learn more about the RandomForestClassifier in Recipe 14.4. You can pass an argument explicitly specifying the desired class weights, In 91 Or you can pass balanced, which automatically creates weights inversely proportional to class frequencies.
In 92 Alternatively, we can downsample the majority class or upsample the minority class. In downsampling, we randomly sample without replacement from the majority class (i.e., the class with more observations) to create a new subset of observations equal in size to the minority class. For example, if the minority class has 10 observations, we will randomly select 10 observations from the majority class and use those 20 observations as our data. Here we do exactly that using our unbalanced iris data. In 93 we use the np.vstack function to vertically stack the feature matrix of class 0 with the downsampled feature matrix of class 1. This results in a new feature matrix containing observations from both classes. Finally, we display the first five rows of the concatenated feature matrix.
In 94 Our other option is to upsample the minority class. In upsampling, for every observation in the majority class, we randomly select an observation from the minority class with replacement. The result is the same number of observations from the minority and majority classes. Upsampling is implemented very similarly to downsampling, just in reverse,
In 95 we use the np.vstack function to vertically stack the upsampled feature matrix of class 0 with the feature matrix of class 1. This results in a new feature matrix containing observations from both classes. Finally, we display the first five rows of the concatenated feature matrix.
Note : Handling imbalanced classes is a common activity in machine learning. Our best strategy is simply to collect more observations—especially observations from the minority class. However, often this is just not possible, so we have to resort to other options.
A second strategy is to use a model evaluation metric better suited to imbalanced classes. Accuracy is often used as a metric for evaluating the performance of a model, but when imbalanced classes are present, accuracy can be ill suited.
A third strategy is to use the class weighing parameters included in implementations of some models. This allows the algorithm to adjust for imbalanced classes. Fortunately, many scikit-learn classifiers have a class_weight parameter, making it a good option.
The fourth and fifth strategies are related: downsampling and upsampling. The decision between using downsampling and upsampling is context-specific, and in general we should try both to see which produces better results.
[1] Machine Learning with Python Cookbook 2nd, by Kyle Gallatin and Chris Albon, O'Reilly, 2023. Chapter 4.
[2] sklearn.preprocessing.MinMaxScaler , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[3] sklearn.preprocessing.StandardScaler , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[4] sklearn.preprocessing.RobustScaler , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[5] sklearn.preprocessing.Normalizer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[6] sklearn.preprocessing.PolynomialFeatures , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[7] sklearn.preprocessing.FunctionTransformer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[8] sklearn.covariance.EllipticEnvelope , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[9] sklearn.cluster.KMeans , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[10] [演算法] K-means 分群 (K-means Clustering) , ramonliao , 2018.
[11] numpy.isnan , © 2008-2022, NumPy Developers.
[12] pandas.DataFrame.dropna , © 2024, pandas via NumFOCUS, Inc. Hosted by OVHcloud.
[13] sklearn.impute.KNNImputer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[14] sklearn.impute.SimpleImputer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[15] Machine Learning with Python Cookbook 2nd, by Kyle Gallatin and Chris Albon, O'Reilly, 2023. Chapter 5.
[16] sklearn.preprocessing.LabelBinarizer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[17] sklearn.preprocessing.MultiLabelBinarizer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[18] pandas.DataFrame.replace , © 2024, pandas via NumFOCUS, Inc. Hosted by OVHcloud.
[19] sklearn.feature_extraction.DictVectorizer , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[20] sklearn.neighbors.KNeighborsClassifier , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.