Samira Samadi - Fair-PCA homepage

Fair-PCA homepage

( Read the full paper, access the code, watch the 3-min video, and read more in the press.)

Principal Component Analysis (PCA) is a linear dimensionality reduction technique which projects the high dimensional data into a low dimensional subspace with the goal of minimizing the reconstruction error between the original data points and the projected ones. Many scientific disciplines have adopted PCA as a default preprocessing step, both to avoid the curse of dimensionality and also to do exploratory/explanatory data analysis (projecting the data into a number of dimensions humans can more easily visualize). PCA is perhaps the most fundamental dimensionality reduction technique in the sciences.

Assume now that the data set consists of two protected populations, such as male records and female records. We show on real-world data sets that PCA can incur much higher reconstruction error for one population than another, even when the populations are of similar size. As an example, consider the Labeled Faces in the Wild (LFW) data set [2]. Figures below show that PCA has higher reconstruction error for women than men faces even if men and women faces are sampled with equal weight.

The left figure below shows that the average reconstruction error of male vs female faces incurred by PCA. For the right figure, we sampled 1000 faces with male and female faces equiprobably. As it can be seen, the average reconstruction error for the female population is still considerably higher than for the male population (up to 10% higher error).

Fair-PCA [1] defines a linear dimensionality reduction technique which aims to represent different populations with similar fidelity -- measured in terms of marginal error or loss. The loss of a population A is defined as the deviation from its optimal low-dimensional approximation :

Given an n-dimensional data set with populations A and B, Fair PCA aims to find a rank-d approximation of the data such that it is the solution to the following optimization:

Our first observation is that an optimal solution to the Fair PCA problem incurs the same average loss for two populations. Moreover, we give a polynomial-time algorithm that finds a matrix U* with rank at most d+1 and prove that the Fair PCA objective value at U* is at most the optimal objective value for d dimensions. The algorithm is based on first solving the Semidefinite Program (SDP) relaxation of the above optimization and then finding an extreme point solution of a derived Linear Program (LP). Finally, we propose an efficient implementation of our algorithm using a Multiplicative Weights Update method.

The figures below show the results of running our algorithm on the LFW [2] and the Credit Data set [3]. As we expect, as the number of dimensions increases, the average reconstruction error of every population decreases. Note that for LFW, the original data is in 1764 dimensions (42×42 pixels), therefore, at 20 dimensions we still see a considerable reconstruction error.

In order to see how fair are each of these methods, we need to zoom in further and look at the average loss of populations. Figures below show the average loss of each population as the result of applying vanilla PCA and Fair PCA on both data sets. Note that at the optimal solution of Fair PCA, the average loss is the same for both populations, therefore we have only one line for “Fair loss”. We observe that vanilla PCA suffers much higher average loss for female faces than male faces. After running fair PCA, we observe that the average loss for fair PCA is relatively in the middle of the average loss for male and female. So, there is improvement in terms of the female average loss which comes with a cost in terms of male average loss. Similar observations holds for the Credit data set.

References:

[1] Samira Samadi, Uthaipon Tantipongpipat, Jamie Morgenstern, Mohit Singh, and Santosh Vempala. "The Price of Fair PCA: One Extra Dimension". In 32nd Conference on Neural Information Processing Systems (NIPS) 2018.

Access the code and data used in the paper at https://github.com/samirasamadi/Fair-PCA .

[2] Huang, Gary B., Marwan Mattar, Tamara Berg, and Eric Learned-Miller. "Labeled faces in the wild: A database forstudying face recognition in unconstrained environments." In Workshop on faces in real-life images: detection, alignment, and recognition. 2008.

This data set consists of 13232 images of dimension 1764 (size 42*42) including 2962 female and 10270 male faces.

[3] I-Cheng Yeh and Che-hui Lien. "The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients". Expert Systems with Applications, 36(2): 2473–2480, 2009.

This data set consists of records of 30000 individuals with 21 features.