During the first months of the COVID epidemic, it was very important to not touch our faces (at least, that's what we've been told). But at the time, I was still in prep school (10hours of work a day, scratching my head for answers to maths, physics and engineer sciences problems). Which means I was touching my face a lot without being aware of it.
So, because I was getting interested in AI in my (rare) free time, I tried to solve the problem with technology. The computer would film me all day long, and notify me when I'm touching my face.
if you search now "do not touch your face" on Google, you will find a lot of websites and apps doing exactly this (https://donottouchyourface.com/, https://facetouch.app/, http://www.stoptouchingyourface.tech/), but remark how they were all created in 2020!
At the time, I did researchs about the existant algorithms, and found only https://donottouchyourface.com/. This website uses a standard CNN, used in few-shot learning (learning from 3 pictures of you not touching the touching your face).
Because of the few shot learning setup, this works very badly, especially if lightning conditions, your position etc. changes in the image.
So my solution was to reuse existing neural networks for hand-pose detection and facial detection, and detect when there was an overlap between hand points and the face. I used Mediapipe pre-trained neural networks, and made 2 versions of the algorithm. One in Javascript (web browser extension, easy to install for anyone), and in Python (the code was a bit cleaner and less hacky in Python, because browser extensions are not supposed to access your camera in background...).
To further improving my algorithm, I reworked the "overlap detection". Indeed, it was sometimes mistaking my hand being in front of my face (when eating, etc.) with me touching my face. Or sometimes, the hand points where recognized weirdly, and triggered a match.
So I designed a very small neural network (2 Dense layers), to predict the overlap, taking into account the depth in z (estimated by mediapipe), the amount of area overlap between the hand and the face, the "certainty" of the 2 mediapipe neural nets, etc. I trained it on data from a few of my family members (we were confined), and found the predictions to be very consistant across individuals.
My algorithm was finally finished, and I used it for all the COVID years, and still sometimes now.
I tuned many aspect of the algorithm.
For example, there is a compromise between detection and resources consumption, so I tested different FPS (Frame-per-second). The optimal was... 1 FPS, because the detection accuracy stayed high (your often touch your face for more than a second), and with a minimal CPU footprint (less than 5%)! But when the user activates the camera feedback (disabled by default), I turn on maximum FPS (~30-50% CPU) for a better User Experience.
I also made a hyper-parameters search for mediapipe algorithms to optimize the accuracy of the downstream task. For example, I don't care if both hands are not perfectly recognized at the same time, ... because you rarely touch your face with both hands at the same time :), and it will in all cases be detected by single hand detector. This prevents cases where a ghost hand is detected on the face, when the other hand is clearly visible.