CelebA is a large, public face image dataset. It contains 202599 images of celebrities, and each image is marked with the presence or absence of attributes. In this post we look at the kinds of attributes present in the dataset, correlations between them and if we can learn to predict an attribute given the others. The analysis is done using Spark (Python) in Databricks.
The CelebA attribute data CSV can be found here It has 41 columns. The first column contains the image names, while the rest contain a particular attribute. Presence of an attribute is indicated by '1' ad absence by '-1'. CelebA has the following 40 attributes:
5_o_Clock_Shadow,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,Blond_Hair,Blurry,Brown_Hair,Bushy_Eyebrows,Chubby,Double_Chin,Eyeglasses,Goatee,Gray_Hair,Heavy_Makeup,High_Cheekbones,Male,Mouth_Slightly_Open,Mustache,Narrow_Eyes,No_Beard,Oval_Face,Pale_Skin,Pointy_Nose,Receding_Hairline,Rosy_Cheeks,Sideburns,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young
Databricks is a very accessible and easy to use platform to start using Spark (Python). Databricks Community offers a free 6GB platform.
CSV files can be loaded into Databricks. From 'Tables', click on 'Create Table', then drag-and-drop the CSV to upload the file to Databricks. Once it is uploaded, note its location, which will be used to import it.
Now create a new notebook in the workspace, and we are ready to start. We read in the data as follows:
attr = sqlContext.read.format("csv").load("/FileStore/tables/ide6zy7i1487648448792/list_attr_celeba.csv", header='true')
for a in listOfAttributes: #convert data from string to int
attr = attr.withColumn(a, attr[a].cast("int"))
First we calculate the relative abundance of each attribute
listOfAttributes = attr.columns[1:] #drop the first column name as it is image names
numRows = attr.count()
print 'number of faces:', numRows #number of images (202599)
print 'list of attributes:', listOfAttributes #40 attributes
for a in listOfAttributes:
numOnes = attr.filter(attr[a] == 1).count()
print a, numOnes/float(numRows) #percentage presence of each attribte
Relative abundance of each attribute
We observe that almost half the attributes are relatively below 20%. Only 3 attributes (no_beard, young
and attractive
) are above 50%.
Some of the attributes might be correlated. Let us now find the correlations between each pair.
for a in range(len(listOfAttributes)):
for b in range(a+1,len(listOfAttributes)):
print (listOfAttributes[a],listOfAttributes[b]), attr.stat.corr(listOfAttributes[a], listOfAttributes[b])
The top highly positively correlated attributes are:
The top negatively correlated attributes are:
We find the expected correlations. The positive correlations have some makeup related attributes which usually go together. Also makeup enhances attractiveness unsurprisingly. From negative correlations we see the patterns of males not using makeup and some kinds of facial hairstyles excluding others.
Correlation heat map between different pairs of attributes
Given the high correlations observed between certain attributes, we might ask the question: Is it possible to predict an attribute if the others are given. To do so first split the data into training (80%) and testing (20%) code and then iterate over each attribute, training a logistic regressor model trying to predict that attribute given all the others.
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.ml.classification import LogisticRegression
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.regression import LabeledPoint
trainingData, testingData = attr.randomSplit([.8,.2],seed=1234)
for i in range(1,len(listOfAttributes)+1):
pts = trainingData.rdd.map(lambda row: LabeledPoint(0.0 if row[i]==-1 else 1.0,[row[j] for j in range(1,len(row)) if i!=j]))
model = LogisticRegressionWithLBFGS.train(pts, iterations=20) #logistic regression model
model.clearThreshold() #default threshold is 0.5. If we do not clear the threshold, we will not get probabilities
predictionAndLabels = testingData.rdd.map(lambda row: ((model.predict([row[j] for j in range(1,len(row)) if i!=j])), 0.0 if row[i]==-1 else 1.0)) #test phase
metrics = BinaryClassificationMetrics(predictionAndLabels)
print listOfAttributes[i-1]; print("Area under PR = %s" % metrics.areaUnderPR); print("Area under ROC = %s" % metrics.areaUnderROC)
Area under PR and ROC curves when predicting an attribute given the others. Attributes are sorted by increasing ROC area
We see that given the others, it seems to be very easy to predict the following attributes: Male, Wearing_Lipstick, No_Beard, Goatee, Heavy_Makeup etc. Some attributes like Narrow_Eyes, Bangs, Pale_Skin, Pointy_Nose, Oval_Face, Big_Lips etc prove difficult to predict. These are attributes that do not have strong positive or negative correlations with others and are hence independent pieces of information which are difficult to predict.
Given the correlation heatmap and the above it seems that any model (say a deep neural network) seeking to predict all 40 attributes should keep the correlation in mind. It might be enough to focus on predicting some of the attributes using complex classifiers like deep networks and then predict the rest using simple techniques like logistic regression, from the DCNN output. Attributes like Narrow_Eyes, Bangs, Pale_Skin, Pointy_Nose, Oval_Face, Big_Lips which are difficult to predict given other attributes should treated with care and classifiers detecting them have to be trained carefully.