CelebA analysis using Spark

CelebA is a large, public face image dataset. It contains 202599 images of celebrities, and each image is marked with the presence or absence of attributes. In this post we look at the kinds of attributes present in the dataset, correlations between them and if we can learn to predict an attribute given the others. The analysis is done using Spark (Python) in Databricks.

CelebA data and Databricks

The CelebA attribute data CSV can be found here It has 41 columns. The first column contains the image names, while the rest contain a particular attribute. Presence of an attribute is indicated by '1' ad absence by '-1'. CelebA has the following 40 attributes:

5_o_Clock_Shadow,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,Blond_Hair,Blurry,Brown_Hair,Bushy_Eyebrows,Chubby,Double_Chin,Eyeglasses,Goatee,Gray_Hair,Heavy_Makeup,High_Cheekbones,Male,Mouth_Slightly_Open,Mustache,Narrow_Eyes,No_Beard,Oval_Face,Pale_Skin,Pointy_Nose,Receding_Hairline,Rosy_Cheeks,Sideburns,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young

Databricks is a very accessible and easy to use platform to start using Spark (Python). Databricks Community offers a free 6GB platform.

Loading data

CSV files can be loaded into Databricks. From 'Tables', click on 'Create Table', then drag-and-drop the CSV to upload the file to Databricks. Once it is uploaded, note its location, which will be used to import it.

Now create a new notebook in the workspace, and we are ready to start. We read in the data as follows:

attr = sqlContext.read.format("csv").load("/FileStore/tables/ide6zy7i1487648448792/list_attr_celeba.csv", header='true')

for a in listOfAttributes:  #convert data from string to int

  attr = attr.withColumn(a, attr[a].cast("int"))

Attribute Frequency

First we calculate the relative abundance of each attribute

listOfAttributes  = attr.columns[1:]  #drop the first column name as it is image names

numRows = attr.count()

print 'number of faces:', numRows  #number of images (202599)

print 'list of attributes:', listOfAttributes  #40 attributes

for a in listOfAttributes:

  numOnes = attr.filter(attr[a] == 1).count()

  print a, numOnes/float(numRows)   #percentage presence of each attribte

Relative abundance of each attribute

We observe that almost half the attributes are relatively below 20%. Only 3 attributes (no_beard, young and attractive) are above 50%.

Correlation between attributes

Some of the attributes might be correlated. Let us now find the correlations between each pair.

for a in range(len(listOfAttributes)):

  for b in range(a+1,len(listOfAttributes)):

    print (listOfAttributes[a],listOfAttributes[b]), attr.stat.corr(listOfAttributes[a], listOfAttributes[b])

The top highly positively correlated attributes are:

Heavy_Makeup, Wearing_Lipstick : 0.801
High_Cheekbones, Smiling : 0.683
Mouth_Slightly_Open, Smiling : 0.536
Chubby, Double_Chin : 0.534
Goatee. Sideburn : 0.513
Attractive, Wearing_Lipstick : 0.480
Attractive, Heavy_Makeup : 0.477
Arched_Eyebrows, Wearing_Lipstick : 0.46
Goatee, Mustache : 0.45
Arched_Eyebrows, Heavy_Makeup : 0.440
High_Cheekbones, Mouth_Slightly_Open : 0.420
No_Beard, Wearing_Lipstick : 0.419
5_o_Clock_Shadow, Male : 0.418
Attractive, Young : 0.388

The top negatively correlated attributes are:

Male, Wearing_Lipstick : -0.789
Heavy_Makeup, Male : -0.667
Goatee, No_Beard : -0.570
No_Beard, Sideburns : -0.543
5_o_Clock_Shadow, No_Beard : -0.527
Male, No_Beard : -0.522
Mustache, No_Beard : -0.453
Arched_Eyebrows, Male : -0.408
Attractive, Male : -0.394
Male, Wearing_Earrings : -0.373
Gray_Hair, Young : -0.364
5_o_Clock_Shadow, Wearing_Lipstick : -0.334
Male, Wavy_Hair : -0.324
Straight_Hair, Wavy_Hair : -0.321
Double_Chin, Young : -0.310

We find the expected correlations. The positive correlations have some makeup related attributes which usually go together. Also makeup enhances attractiveness unsurprisingly. From negative correlations we see the patterns of males not using makeup and some kinds of facial hairstyles excluding others.

Correlation heat map between different pairs of attributes

Predicting attributes

Given the high correlations observed between certain attributes, we might ask the question: Is it possible to predict an attribute if the others are given. To do so first split the data into training (80%) and testing (20%) code and then iterate over each attribute, training a logistic regressor model trying to predict that attribute given all the others.

from pyspark.mllib.classification import LogisticRegressionWithLBFGS

from pyspark.ml.classification import LogisticRegression

from pyspark.mllib.evaluation import BinaryClassificationMetrics

from pyspark.mllib.regression import LabeledPoint

trainingData, testingData = attr.randomSplit([.8,.2],seed=1234)

for i in range(1,len(listOfAttributes)+1):

  pts = trainingData.rdd.map(lambda row: LabeledPoint(0.0 if row[i]==-1 else 1.0,[row[j] for j in range(1,len(row)) if i!=j]))

  model = LogisticRegressionWithLBFGS.train(pts, iterations=20)  #logistic regression model

  model.clearThreshold()  #default threshold is 0.5. If we do not clear the threshold, we will not get probabilities

  predictionAndLabels = testingData.rdd.map(lambda row: ((model.predict([row[j] for j in range(1,len(row)) if i!=j])), 0.0 if row[i]==-1 else 1.0))  #test phase

  metrics = BinaryClassificationMetrics(predictionAndLabels)

  print listOfAttributes[i-1];  print("Area under PR = %s" % metrics.areaUnderPR);  print("Area under ROC = %s" % metrics.areaUnderROC)

Area under PR and ROC curves when predicting an attribute given the others. Attributes are sorted by increasing ROC area

We see that given the others, it seems to be very easy to predict the following attributes: Male, Wearing_Lipstick, No_Beard, Goatee, Heavy_Makeup etc. Some attributes like Narrow_Eyes, Bangs, Pale_Skin, Pointy_Nose, Oval_Face, Big_Lips etc prove difficult to predict. These are attributes that do not have strong positive or negative correlations with others and are hence independent pieces of information which are difficult to predict.

Given the correlation heatmap and the above it seems that any model (say a deep neural network) seeking to predict all 40 attributes should keep the correlation in mind. It might be enough to focus on predicting some of the attributes using complex classifiers like deep networks and then predict the rest using simple techniques like logistic regression, from the DCNN output. Attributes like Narrow_Eyes, Bangs, Pale_Skin, Pointy_Nose, Oval_Face, Big_Lips which are difficult to predict given other attributes should treated with care and classifiers detecting them have to be trained carefully.

Google Sites

Report abuse