Programming 6 - Logistic Regression

GOAL : Learn about logistic regression and how to implement logistic regression with scikit-learn.

Learning experience: Through this assignment, I gained a deeper understanding of logistic regression. Whether it's binary or multiple classifiers, logistic regression can handle the task effectively. In this exercise, it's evident how powerful logistic regression is, and I learned that lbfgs performs well with human face datasets. This assignment familiarized me with the application of logistic regression and deepened my understanding of handling image classification tasks.

working environment :

OS: Windows 11 home

CPU : intel i9-13900k

GPU : Nvidia RTX 4090

Python Version : 3.12.2

Development environment: jupyter notebook.

Chapter 16. Logistic Regression [1]

16.0 Introduction

Despite being called a regression, logistic regression is actually a widely used supervised classification technique. Logistic regression (and its extensions, like multinomial logistic regression) is a straightforward, well-understood approach to predicting the probability that an observation is of a certain class. In this chapter, we will cover training a variety of classifiers using logistic regression in scikit-learn.

What is logistic regression?

Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given data set of independent variables.

This type of statistical model (also known as logit model) is often used for classification and predictive analytics. Since the outcome is a probability, the dependent variable is bounded between 0 and 1. In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds, and this logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))

ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k

In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable. The beta parameter, or coefficient, in this model is commonly estimated via maximum likelihood estimation (MLE). This method tests different values of beta through multiple iterations to optimize for the best fit of log odds. All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimate. Once the optimal coefficient (or coefficients if there is more than one independent variable) is found, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability. For binary classification, a probability less than .5 will predict 0 while a probability greater than 0 will predict 1. After the model has been computed, it’s best practice to evaluate the how well the model predicts the dependent variable, which is called goodness of fit. The Hosmer–Lemeshow test is a popular method to assess model fit.

Log odds can be difficult to make sense of within a logistic regression data analysis. As a result, exponentiating the beta estimates is common to transform the results into an odds ratio (OR), easing the interpretation of results. The OR represents the odds that an outcome will occur given a particular event, compared to the odds of the outcome occurring in the absence of that event. If the OR is greater than 1, then the event is associated with a higher odds of generating a specific outcome. Conversely, if the OR is less than 1, then the event is associated with a lower odds of that outcome occurring. Based on the equation from above, the interpretation of an odds ratio can be denoted as the following: the odds of a success changes by exp(cB_1) times for every c-unit increase in x. To use an example, let’s say that we were to estimate the odds of survival on the Titanic given that the person was male, and the odds ratio for males was .0810. We’d interpret the odds ratio as the odds of survival of males decreased by a factor of .0810 when compared to females, holding all other variables constant.

Linear regression vs logistic regression

Both linear and logistic regression are among the most popular models within data science, and open-source tools, like Python and R, make the computation for them quick and easy.

Linear regression models are used to identify the relationship between a continuous dependent variable and one or more independent variables. When there is only one independent variable and one dependent variable, it is known as simple linear regression, but as the number of independent variables increases, it is referred to as multiple linear regression. For each type of linear regression, it seeks to plot a line of best fit through a set of data points, which is typically calculated using the least squares method.

Similar to linear regression, logistic regression is also used to estimate the relationship between a dependent variable and one or more independent variables, but it is used to make a prediction about a categorical variable versus a continuous one. A categorical variable can be true or false, yes or no, 1 or 0, et cetera. The unit of measure also differs from linear regression as it produces a probability, but the logit function transforms the S-curve into straight line.

While both models are used in regression analysis to make predictions about future outcomes, linear regression is typically easier to understand. Linear regression also does not require as large of a sample size as logistic regression needs an adequate sample to represent values across all the response categories. Without a larger, representative sample, the model may not have sufficient statistical power to detect a significant effect.

Types of logistic regression

There are three types of logistic regression models, which are defined based on categorical response.

Binary logistic regression: In this approach, the response or dependent variable is dichotomous in nature—i.e. it has only two possible outcomes (e.g. 0 or 1). Some popular examples of its use include predicting if an e-mail is spam or not spam or if a tumor is malignant or not malignant. Within logistic regression, this is the most commonly used approach, and more generally, it is one of the most common classifiers for binary classification.
Multinomial logistic regression: In this type of logistic regression model, the dependent variable has three or more possible outcomes; however, these values have no specified order. For example, movie studios want to predict what genre of film a moviegoer is likely to see to market films more effectively. A multinomial logistic regression model can help the studio to determine the strength of influence a person's age, gender, and dating status may have on the type of film that they prefer. The studio can then orient an advertising campaign of a specific movie toward a group of people likely to go see it.
Ordinal logistic regression: This type of logistic regression model is leveraged when the response variable has three or more possible outcome, but in this case, these values do have a defined order. Examples of ordinal responses include grading scales from A to F or rating scales from 1 to 5. [2]

Model

The logistic function is of the form:

where μ is a location parameter (the midpoint of the curve, where 𝑝(𝜇)=1/2) and s is a scale parameter. This expression may be rewritten as:

where 𝛽0=−𝜇/𝑠 and is known as the intercept (it is the vertical intercept or y-intercept of the line 𝑦=𝛽0+𝛽1𝑥), and 𝛽1=1/𝑠 (inverse scale parameter or rate parameter): these are the y-intercept and slope of the log-odds as a function of x. Conversely, 𝜇=−𝛽0/𝛽1 and 𝑠=1/𝛽1.

Fit

The usual measure of goodness of fit for a logistic regression uses logistic loss (or log loss), the negative log-likelihood. For a given xk and yk, write 𝑝𝑘=𝑝(𝑥𝑘). The 𝑝𝑘 are the probabilities that the corresponding 𝑦𝑘 will equal one and 1−𝑝𝑘 are the probabilities that they will be zero (see Bernoulli distribution). We wish to find the values of 𝛽0 and 𝛽1 which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (yk), the squared error loss, is taken as a measure of the goodness of fit, and the best fit is obtained when that function is minimized.The log loss for the k-th point ℓ𝑘 is:

These can be combined into a single expression:

This expression is more formally known as the cross-entropy of the predicted distribution (𝑝𝑘,(1−𝑝𝑘)) from the actual distribution (𝑦𝑘,(1−𝑦𝑘)), as probability distributions on the two-element space of (pass, fail).

The sum of these, the total loss, is the overall negative log-likelihood −ℓ, and the best fit is obtained for those choices of 𝛽0 and 𝛽1 for which −ℓ is minimized.

Alternatively, instead of minimizing the loss, one can maximize its inverse, the (positive) log-likelihood:

or equivalently maximize the likelihood function itself, which is the probability that the given data set is produced by a particular logistic function:

This method is known as maximum likelihood estimation.

The right figure shows the standard logistic function 𝜎(𝑡); 𝜎(𝑡)∈(0,1) for all 𝑡. [3]

In scikit-learn it provide the model name sklearn.linear_model.LogisticRegression: [4]

class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None), Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation, or no regularization. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization is only supported by the ‘saga’ solver.

There are some important parameters :

penalty{‘l1’, ‘l2’, ‘elasticnet’, None}, default=’l2’ : Specify the norm of the penalty:
- None: no penalty is added;
- 'l2': add a L2 penalty term and it is the default choice;
- 'l1': add a L1 penalty term;
- 'elasticnet': both L1 and L2 penalty terms are added.
Cfloat, default=1.0 : Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
solver{‘lbfgs’, ‘liblinear’, ‘newton-cg’, ‘newton-cholesky’, ‘sag’, ‘saga’}, default=’lbfgs’ : Algorithm to use in the optimization problem. Default is ‘lbfgs’. To choose a solver, you might want to consider the following aspects:
- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;
- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
- ‘liblinear’ is limited to one-versus-rest schemes.
- ‘newton-cholesky’ is a good choice for n_samples >> n_features, especially with one-hot encoded categorical features with rare categories. Note that it is limited to binary classification and the one-versus-rest reduction for multiclass classification. Be aware that the memory usage of this solver has a quadratic dependency on n_features because it explicitly computes the Hessian matrix.
max_iterint, default=100 : Maximum number of iterations taken for the solvers to converge.
multi_class{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’ : If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’. see more.

And some commonly used method :

decision_function(X) : Predict confidence scores for samples.
fit(X, y[, sample_weight]) : Fit the model according to the given training data.
predict(X) : Predict class labels for samples in X.
score(X, y[, sample_weight]) : Return the mean accuracy on the given test data and labels. see more.

At the end, we briefly introduce the logistic regression algorithm's workflow:

Data Preparation: Prepare a labeled training dataset containing independent variables (features) and corresponding dependent variables (target variables).
Parameter Initialization: Initialize model parameters, including the intercept term and weights for the independent variables.
Compute Predicted Probabilities: Use the logistic function to compute the probability of each sample belonging to the positive class based on the linear combination of independent variables.
Calculate Loss Function: Use the logarithmic loss function to measure the difference between the model's predicted values and the actual values. This quantifies the performance of the model and is used to update the model parameters.
Gradient Descent: Utilize gradient descent or its variants to update the model parameters based on the gradient of the loss function, aiming to minimize the loss. This improves the model's ability to accurately predict the class of samples.
Iterative Updates: Repeat steps 3 to 5 until the model converges or reaches a specified stopping condition.
Model Evaluation: Evaluate the performance of the trained model using a test dataset. This involves comparing the model's predictions with the actual observed values.
Model Application: Apply the trained model to new unseen data for prediction or classification purposes.

16.1 Training a Binary Classifier

This section will train a simple classifier model.

In 1 implements logistic regression model using Scikit-learn.

Firstly, we import necessary modules and dataset from Scikit-learn library. In this example, we are using the Iris dataset, but only selecting the first 100 samples and keeping only two classes.

Next, we standardize the features to ensure that they have the same scale. This is done using the StandardScaler class.

Then, we create a LogisticRegression object, setting a seed value (random_state=0) to ensure consistency of results.

Finally, we fit the model to the standardized features and target variables using the fit method, in order to train the model.

The main functionality of this code snippet is to train a logistic regression model using the Iris dataset, which can predict whether a sample belongs to one of the two classes.

In a logistic regression, a linear model (e.g., β0 + β1x) is included in a logistic (also called sigmoid) function, 1/(1+exp(-z)), such that:

where 𝑃(𝑦𝑖=1∣𝑋) is the probability of the 𝑖th observation’s target value, 𝑦𝑖, being class 1; 𝑋 is the training data; 𝛽0 and 𝛽1 are the parameters to be learned; and 𝑒 is Euler’s number. The effect of the logistic function is to constrain the value of the function’s output to between 0 and 1, so that it can be interpreted as a probability. If 𝑃(𝑦𝑖=1∣𝑋) is greater than 0.5, class 1 is predicted; otherwise, class 0 is predicted.

In scikit-learn, we can train a logistic regression model using LogisticRegression. Once it is trained, we can use the model to predict the class of new observations, In 2 creates a new observation new_observation, which contains four features: [0.5, 0.5, 0.5, 0.5].

Then, it uses the predict method of the model to make a class prediction for this new observation.

In this example, our observation was predicted to be class 1.

Additionally, In 3 we can see the probability that an observation is a member of each class.

The output means that our observation had a 17.7% chance of being class 0 and an 82.2% chance of being class 1.

16.2 Training a Multiclass Classifier

This section will learn when you given more than two classes, you need to train a classifier model.

In 8 implements a multi-class logistic regression model using Scikit-learn.

Firstly, we import necessary modules and dataset from Scikit-learn library. In this example, we are using the Iris dataset, which contains 150 samples and 4 features.

Next, we standardize the features to ensure that they have the same scale. This is done using the StandardScaler class.

Then, we create a logistic regression object. In this example, we set the parameters random_state=0 and multi_class="ovr". The random_state=0 ensures the consistency of results, while multi_class="ovr" indicates that we are using the one-vs-rest strategy to handle multi-class classification problems.

Finally, we fit the model to the standardized features and target variables using the fit method, in order to train the model.

The main functionality of this code snippet is to train a multi-class logistic regression model using the Iris dataset, which can predict the class of samples based on their features.

In 9 & 10 we also create a new observation uses the predict method of the model to make a class prediction for this new observation.

In this example, our observation was predicted to be class 2.

Additionally, In 10 we can see the probability that an observation is a member of each class.

The output means that our observation had a 3% chance of being class 0, an 40.6% chance of being class 1, and an 55.4% chance of being class 2.

On their own, logistic regressions are only binary classifiers, meaning they cannot handle target vectors with more than two classes. However, two clever extensions to logistic regression do just that. First, in one-vs-rest logistic regression (OvR) a separate model is trained for each class predicted, whether an observation is that class or not (thus making it a binary classification problem). It assumes that each classification problem (e.g., class 0 or not) is independent.

Alternatively, in multinomial logistic regression (MLR), the logistic function we saw in Recipe 16.1 is replaced with a softmax function:

where 𝑃(𝑦𝑖=𝑘∣𝑋) is the probability of the 𝑖th observation’s target value, 𝑦𝑖, being in class 𝑘, and 𝐾 is the total number of classes. One practical advantage of MLR is that its predicted probabilities using the predict_proba method are more reliable (i.e., better calibrated).

When using LogisticRegression we can select which of the two techniques we want, with OvR (ovr) being the default argument. We can switch to MLR by setting the argument to multinomial.

Note : One-vs.-rest[2]: 182, 338 (OvR or one-vs.-all, OvA or one-against-all, OAA) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued score for its decision (see also scoring rule), rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.[2]: 182 [note 1] [5]

16.3 Reducing Variance Through Regularization

This section will reduce the variance of your logistic regression model.

In 11 utilizes the LogisticRegressionCV class from the Scikit-learn library, which is the cross-validation (CV) version of the logistic regression model.

Firstly, we import necessary modules and dataset from the Scikit-learn library. In this example, we use the Iris dataset, which contains 150 samples and 4 features.

Next, we standardize the features to ensure they have the same scale. This is done using the StandardScaler class.

Then, we create a LogisticRegressionCV object. This object uses cross-validation to select the best regularization parameter. In this example, we set penalty='l2' (using L2 regularization) and Cs=10 (testing 10 candidate regularization parameter values), along with other parameters like random_state=0 and n_jobs=-1 (using all available CPU cores for parallel computation).

Finally, we fit the model to the standardized features and target variables using the fit method, to train the model.

The main functionality of this code snippet is to train a logistic regression model using the Iris dataset while using cross-validation to select the best regularization parameter. This helps improve the model's generalization ability and reduces overfitting.

Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize, typically the L1 and L2 penalties. In the L1 penalty:

where 𝛽^𝑗 is the parameters of the 𝑗th of 𝑝 features being learned, and 𝛼 is a hyperparameter denoting the regularization strength. With the L2 penalty:

Higher values of 𝛼 increase the penalty for larger parameter values (i.e., more complex models). scikit-learn follows the common method of using 𝐶 instead of 𝛼 where 𝐶 is the inverse of the regularization strength: 𝐶=1𝛼. To reduce variance while using logistic regression, we can treat 𝐶 as a hyperparameter to be tuned to find the value of 𝐶 that creates the best model. In scikit-learn we can use the LogisticRegressionCV class to efficiently tune 𝐶. LogisticRegressionCV’s parameter Cs can either accept a range of values for 𝐶 to search over (if a list of floats is supplied as an argument) or, if supplied an integer, will generate a list of that many candidate values drawn from a logarithmic scale between –10,000 and 10,000.

16.4 Training a Classifier on Very Large Data

This section will train a simple classifier model on a very large set of data.

In 14 utilizes the LogisticRegression class from Scikit-learn to create a logistic regression model.

Firstly, we import the LogisticRegression class, datasets module, and StandardScaler class from Scikit-learn library. Then, we load the Iris dataset, which contains features and target variables.

Next, we standardize the features using StandardScaler to ensure they have the same scale.

Then, we create a LogisticRegression object with random_state=0 and solver="sag" parameters. Here, random_state=0 ensures the consistency of results, while solver="sag" specifies the use of the stochastic average gradient descent (SAG) optimization algorithm.

Finally, we use the fit method to fit the model to the standardized features and target variables to train the model.

The main functionality of this code is to train a logistic regression model using the Iris dataset and optimize the model's parameters using the SAG optimization algorithm. This helps improve the model's performance and convergence speed.

scikit-learn’s LogisticRegression offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us that we cannot do something with that solver. However, there is one particular case we should be aware of.

Stochastic average gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling, so standardizing our features is particularly important. We can set our learning algorithm to use this solver by setting solver="sag".

16.5 Handling Imbalanced Classes

This section will handling Imbalanced classes.

In 17 utilizes the LogisticRegression class from Scikit-learn to create a logistic regression model.

Firstly, we import the LogisticRegression class, datasets module, and StandardScaler class from Scikit-learn library. Then, we load the Iris dataset, which contains features and target variables.

Then, we make the target classes highly imbalanced by removing the first 40 observations. Next, we transform the target variable into a binary class, where the label for class 0 is 0 and the label for other classes is 1.

Next, we standardize the features using StandardScaler to ensure they have the same scale.

Then, we create a LogisticRegression object with random_state=0 and class_weight="balanced" parameters. Here, random_state=0 ensures the consistency of results, while class_weight="balanced" specifies adjusting the class weights to balance the classes, addressing the issue of class imbalance.

Finally, we use the fit method to fit the model to the standardized features and target variables to train the model.

The main functionality of this code is to train a logistic regression model using the Iris dataset while handling the issue of class imbalance. This helps improve the model's predictive performance for minority classes.

Like many other learning algorithms in scikit-learn, LogisticRegression comes with a built-in method of handling imbalanced classes. If we have highly imbalanced classes and have not addressed it during preprocessing, we have the option of using the class_weight parameter to weight the classes to make certain we have a balanced mix of each class. Specifically, the balanced argument will automatically weigh classes inversely proportional to their frequency:

where 𝑤𝑗 is the weight to class 𝑗, 𝑛 is the number of observations, 𝑛𝑗 is the number of observations in class 𝑗, and 𝑘 is the total number of classes.

Exercise : Use Olivetti faces data-set and the Labeled Faces in the Wild (LFW) people dataset to implement logistic regression.

First introduce Olivetti faces data-set

This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge. The sklearn.datasets.fetch_olivetti_faces function is the data fetching / caching function that downloads the data archive from AT&T.

As described on the original website:

There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).

Data Set Characteristics:

Classes : 40
Samples total : 400
Dimensionality : 4096
Features : real, between 0 and 1

The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to floating point values on the interval [0, 1], which are easier to work with for many algorithms.

The “target” for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised perspective.

The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images.

When using these images, please give credit to AT&T Laboratories Cambridge. [6]

Below figure shows the 40 people's face from the dataset.

Next we plot the first 10 images for the first two persons (face id=0 and face id=1) from the Olivetti Faces dataset.

In 2 implement a logistic regression classifier using the Olivetti Faces dataset.

Firstly, the fetch_olivetti_faces function from the Scikit-learn library is utilized to load the Olivetti Faces dataset. This dataset comprises 400 grayscale images of human faces, each of size 64x64 pixels, with 40 distinct individuals, each having 10 different pictures.

Next, the dataset is split into features (X) and target labels (y). X contains the pixel values of each facial image, while y contains the corresponding labels indicating the person's identity.

Subsequently, the train_test_split function is employed to partition the dataset into training and testing sets, with the testing set constituting 20% of the total dataset. This ensures that different data are used for training and evaluating the model.

Following that, the features are standardized to ensure that each feature contributes equally to the model. This is achieved using the StandardScaler to standardize the features.

Then, an instance of the logistic regression classifier, LogisticRegression, is created with default parameters.

The model is then trained using the training set, i.e., the standardized features X_train_scaled and their corresponding target labels y_train.

Next, the trained model is used to make predictions on the testing set, and the predicted labels are stored in y_pred.

Finally, the performance of the model is evaluated using the accuracy_score and classification_report functions. The accuracy_score function calculates the accuracy of the model, while the classification_report function provides a detailed classification report, including precision, recall, F1-score, and other metrics.

The accuracy we use accuracy_score from scikit-learn, If 𝑦^𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the fraction of correct predictions over 𝑛samples is defined as :

where 1(𝑥) is the indicator function. [7]

Below table show the result that we have tested, we can say that logistic regression with lbfgs solver will get the best accuracy.

Let's try different method! Below left table shows the SVM with GridSearchCV with different parameters results. Right table shows the KNN with different n.

At the end, we compare all the method together, below table shows the result, we can say that the logistic regression show better result at this olivetti faces dataset.

According to this paper [8] , i think our result are correct.

Now Lets change the dataset to LFW people. First we introduce this dataset is a collection of JPEG pictures of famous people collected over the internet, all details are available on the official website:

http://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. The typical task is called Face Verification: given a pair of two pictures, a binary classifier must predict whether the two images are from the same person.

An alternative task, Face Recognition or Face Identification is: given the picture of the face of an unknown person, identify the name of the person by referring to a gallery of previously seen pictures of identified persons.

Both Face Verification and Face Recognition are tasks that are typically performed on the output of a model trained to perform Face Detection. The most popular model for Face Detection is called Viola-Jones and is implemented in the OpenCV library. The LFW faces were extracted by this face detector from various online websites.

Data Set Characteristics:

Classes : 5749
Samples total : 13233
Dimensionality : 5828
Features : real, between 0 and 255

Below shows some figure from the dataset.

Following the Usage from scikit-learn [9]

In 25 utilizes the dataset for face recognition task called fetch_lfw_people.

Firstly, it imports necessary functions and classes from sklearn.datasets.

Then, it loads the data from the Labeled Faces in the Wild (LFW) dataset using the fetch_lfw_people function. This dataset has been appropriately preprocessed and scaled, with a fixed size. In this code, parameters min_faces_per_person=70 and resize=0.4 are set, indicating that each person has at least 70 images, and each image is scaled to 40% of its original size.

Next, the dataset is split into features (X) and target labels (y). Then, the dataset is further divided into training and testing sets, with 20% of the data reserved for testing and the rest used for training the model.

Standardization of features is performed to ensure consistent scaling across features, aiding faster convergence of the model. Here, StandardScaler is used to standardize the features.

Subsequently, an instance of logistic regression classifier is created. The classifier utilizes the sag solver (Stochastic Average Gradient descent) to optimize the model. During the training of the model, a progress bar is displayed to show the progress of training, iterating the training process 10 times.

Finally, the model performs predictions, and the performance of the model is evaluated by calculating the accuracy and generating a classification report.

This code is designed to classify the images in the face dataset, utilizing logistic regression classifier for model training and evaluation.

Below table shows the result that we have test, we get the solver = "saga" and max_iter = 10 is the best model.

Next we also try another method. Below left table shows the SVM with GridSearchCV with different parameters results. Right table shows the KNN with different n.

At the end we compare all we have done method. we can also have a conclusion that logistic regression can make the best accuracy in the LFW people dataset.

Full code : github

Reference :

[1] Chapter 16. Logistic Regression , Machine Learning with Python - Theory and Implementation

[2] What is logistic regression?, IBM.

[3] Logistic regression, WikiPedia.

[4] sklearn.linear_model.LogisticRegression , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[5] Multiclass classification, WikiPedia.

[6] The Olivetti faces dataset , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[7] Accuracy score , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[8] Mittal, Harshit, Evaluating The Performance of Feature Extraction Techniques Using Classification Techniques (August 19, 2023). Computer Science & Information Technology (CS & IT), Volume 13, Number 14, August 2023, 4th International Conference on NLP Trends & Technologies (NLPTT 2023), August 19 ~ 20, 2023, Chennai, India. Volume Editors : David C. Wyld, Dhinaharan Nagamalai (Eds), ISBN : 978-1-923107-0, Available at SSRN: https://ssrn.com/abstract=4550494

[9] The Labeled Faces in the Wild face recognition dataset , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Page updated

Google Sites

Report abuse