We re-implemented the model used by VPG with the goal of using it as a benchmark. Once we finished this, we split up to work on two separate models. The first was much more algorithmic, built mostly as an extension of VPG's existing work. The second involved more machine learning, using a CNN to perform the entire analysis process.
Benchmark
We built the benchmark algorithm based off the information provided to us by Alex Page. It works as a four-step process once the image has been read into the program. First, we use an existing facial recognition algorithm to find faces in the image. We started out trying to use OpenCV's built in Haar Cascades, which was inaccurate and quite slow on our personal laptops we were using for development. We transitioned to YuNet, a small but still accurate face recognition model which could run on CPU instead of GPU. We also took the largest face the model found in the scene to avoid accidentally recognizing anything else in the scene that looked a bit like a face.
Once we've found the face in the scene, we isolate the desired region of interest. For the benchmark, we simply used the full face's bounding box. Then, we convert to TSL color space and average the lightness value of the pixels contained in the bounding box. If this value is below a certain threshold, we guess that the person in the video has dark skin. Otherwise, we guess they have light skin.
Color Spaces
The color space algorithm was inspired by the benchmark. It was implemented as an extension of the benchmark, in fact. The main goal of implementing the color space algorithm was to experiment with improving the model in a few ways. We experimented mainly with different color spaces and with different ROIs to grab pixels from. We also implemented a CLI to make using the program easier. We found that the TSL color space did work the best, although we tried using L*a*b* and YCrCb as well. We did find that the full face bounding box was causing problems. It often included hair, background, and clothes. We tested a few different regions of the face, and found that a diamond representation of the cheeks performed the best. We found this diamond using landmarks on the nose and mouth, and extrapolated as to the position of the cheek from there. Using this ROI led to more accurate predictions.
In the future, we would like to add three things to this model. First, we have to consider background lighting. This was causing issues for this model all the time, and being able to guess the lighting of the scene from the whole image would result in a considerable boost in accuracy. We would also like to experiment with other ROIs - namely, we want to try using the forehead and a combination of the forehead and cheeks. Finally, we would like to implement some way of taking the most common pixel lightness rather than the average of all pixel lightnesses. Ideally, we want to be able to find the most common range of pixels within 5-10 units apart, and guess based on that value.
CNN
The Convolutional Neural Network (CNN) algorithm aims to leverage a more machine-learning-based approach to classify skin-tones. Instead of outputting the skin tone in a particular color space and then determining if it is light or dark according to a threshold, the model classifies the skin tone directly to a binary result of 0 for light and 1 for dark. In particular, images are passed to the input layer of shape (3, 128, 128) for 3 RGB color channels and images resized to 128x128 pixels. Then, it passes through three Convolutional Layers with 16 filters, 32 filters, and 64 filters respectively, each with a ReLU activation function and each followed by a MaxPool Layer with a 2x2 kernel. Lastly, the results are flattened into one dimension and passed through two Fully Connected Layers with a sigmoid activation function to produce the binary output. This model architecture was chosen because it abides by various rules of thumb in the current literature about image classification with CNNs and because of the resources available to use in the scope of this project. Preprocessing frames to the shape of (3, 128, 128) maintains standard practices and could allow future versions of the model to be more easily integrated with open-source pre-trained layers like ResNet and DenseNet implementations. In addition, because the dataset provided by VPGTechnologies was sensitive to the study's participants, we wanted to train the model locally with our own personal computing resources, making a smaller and simplistic model more ideal.
To train the model, we created a dataset from frames of various videos of 50 light-skinned and dark-skinned patients provided by VPGTechnologies. In order to avoid the model "memorizing" the patients themselves instead of the patients' skin tone, we added all of the frames from 2 randomly selected dark patients and 2 randomly selected light patients to the validation dataset only (3006 frames in total). Then, we added 2989 more randomly selected frames across all of the patients to bring the validation dataset up to 20% of the entire data provided while the remaining 80% was used for training. Then, we trained the model with a binary cross-entropy loss function with a batch size of 32, a learning rate of 0.001, and utilizing the Adam optimizer.
Results
In order to get a general sense of the efficacy of each model with regard to the dataset provided by VPGTechnologies, we created a validation testing program that tests each model against 3 frames each of 5 measurements each from 8 light-skin and 8 dark-skin patients. The results are pictured in the table.
Clearly, the models have much better accuracy in classifying light skin tones than dark skin tones. For the Benchmark and Color Space models, this could be because of imperfectly chosen threshold values, difficult lighting conditions, or because the selected color spaces are ineffective indicators of skin tone. For the CNN, this is likely because of the relative lack of dark patients in the provided dataset, but overall, the CNN was better at predicting dark skin tones which shows promise for this method of skin tone detection in VPGTechnology's HealthKam.
With regard to runtime, the Color Space algorithm and the CNN had comparable results, and both of them showed improvement over the Benchmark algorithm. However, if future versions of the Color Space algorithm utilized a more computationally difficult ROI, it could perform slower, while if future versions of the CNN had a more complex architecture, it could perform slower.
Overall, these algorithms could both certainly enhance the efficacy of VPGTechnology's HealthKam. In addition to the future steps mentioned above for the Color Space method, future steps for the CNN include training the model in an environment with higher compute power allowing for more epochs and more complex architecture, gathering more data, and further fine-tuning the hyperparameters. With the results from our tool, VPGTechnology engineers can achieve a better understanding of how to improve skin-tone classification in their HealthKam app.