Live Video Emotion Detector
Clemson University - Gianluca Buonanno
Clemson University - Gianluca Buonanno
April 17, 2025
Estimated Read: 25min
Introduction:
The ability to interpret human emotions and demographic characteristics, such as age, from visual data is a complex task that has long fascinated researchers in computer science and psychology alike. For humans, reading emotions or estimating age from a face involves a nuanced interplay of contextual cues, subtle facial muscle movements, and experiential knowledge. However, replicating this capability in machines presents significant challenges, particularly when processing live video feeds in real time. Variations in facial features, expressions, lighting conditions, and camera angles introduce substantial complexity, while the demand for low-latency performance adds further constraints. My project aimed to address these challenges by developing a robust system capable of detecting emotions and estimating age from live video streams with high accuracy and efficiency. By leveraging advanced computer vision techniques and pre-trained deep learning models, I sought to create a solution that not only performs reliably under diverse conditions but also provides transparent insights into its decision-making process.
To achieve this, I utilized DeepFace, a powerful open-source framework that encapsulates pre-trained convolutional neural networks (CNNs) for facial analysis. DeepFace is built upon a vast dataset comprising thousands of facial images annotated with emotional and demographic attributes, enabling it to recognize patterns associated with emotions such as happiness, sadness, anger, and surprise, as well as estimate age. My system integrates DeepFace with real-time face detection and tracking mechanisms to monitor multiple individuals in a video feed. Each detected face is assigned a unique identifier, allowing the system to maintain continuity as faces move across frames. Beyond simply identifying emotions, the system quantifies its confidence in each prediction through confidence scores, providing users with a measure of reliability. Additionally, it estimates the age of each individual and tracks emotional trends over time, offering a dynamic view of affective states.
The project culminated in the development and comparison of two distinct implementations: vision_live_5.py and vision_live_6.py. These systems were designed to balance accuracy, speed, and computational efficiency, with each taking a slightly different approach to facial analysis. Vision_live_5.py prioritizes lightweight processing, analyzing frames at regular intervals in a separate thread to ensure smooth real-time performance. It employs dlib’s frontal face detector for efficient face detection and incorporates preprocessing techniques, such as histogram equalization, to enhance image quality under varying lighting conditions. In contrast, vision_live_6.py explores more advanced features, including 3D facial reconstruction and face recognition, inspired by cutting-edge research in facial analysis. However, these additional capabilities increase computational complexity, potentially compromising performance in resource-constrained environments.
Related Work:
Facial Action Coding System (FACS): The Facial Action Coding System (FACS), developed by Paul Ekman and Wallace V. Friesen, is a systematic approach to categorize all visually discernible facial movements. It decomposes expressions into Action Units (AUs), where each AU corresponds to a specific muscle movement, such as AU2 for raising the inner brow or AU12 for a lip corner pull (smiling). This system is grounded in anatomical and psychological research, providing a standardized framework for annotating facial expressions. While FACS is highly detailed and accurate, it is a manual process requiring trained coders, making it labor-intensive and unsuitable for real-time applications. For instance, analyzing a single video frame can take minutes, which is impractical for live video streams. However, FACS has been instrumental in creating annotated datasets, such as the Cohn-Kanade Extended (CK+) dataset, which are crucial for training machine learning models. FACS indirectly supports my work done on this project by providing the foundation for dataset annotations, ensuring the labeled data used in DeepFace is reliable. By leveraging datasets annotated with FACS, such as CK+, my project benefits from a standardized and validated approach to emotion labeling. This reduces the need for manual annotation, allowing most of my efforts to be focused on model integration rather than data preparation, since I'm using a pre-trained dataset.
Convolutional Neural Networks (CNNs): Are a class of deep learning models designed for processing grid-like data, such as images. In FER, CNNs automatically learn hierarchical features from facial images, starting with low-level features like edges and progressing to high-level features like facial landmarks. Popular indusry aligned architectures include:
VGGNet: Known for its depth and simplicity, often used as a baseline.
ResNet: Utilizes residual connections to enable deeper networks, improving training stability.
Inception Networks: Employ multi-scale convolutions to capture features at different scales.
MobileNet: Optimized for efficiency, suitable for real-time applications on mobile devices.
Training CNNs from scratch requires large labeled datasets, which can be computationally expensive. To address this, researchers often use transfer learning, fine-tuning pre-trained models for FER tasks. Data augmentation techniques, such as random cropping, flipping, and rotation, are also employed to increase dataset diversity and improve generalization. Recent studies, such as "Image-based facial emotion recognition using convolutional neural network on Emognition dataset" (Scientific Reports), demonstrate the use of transfer learning with pre-trained models like Inception-V3 and MobileNet-V2, achieving high accuracy on datasets with ten emotions.
Cohn-Kanade Extended: Contains posed expressions, labeled with FACS AUs and emotion categories, widely used for benchmarking.
FER2013: Over 35,000 grayscale images, used in the ICML 2013 Challenges in Representation Learning, covering seven emotions.
AffectNet: Over 1 million images, annotated with emotion categories and valence-arousal dimensions, suitable for "in-the-wild" scenarios.
Emotion Recognition in the Wild: Focuses on uncontrolled environments, challenging for robust model training.
Surrey Audio-Visual Expressed Emotion: Audio-visual recordings from male speakers, useful for multimodal FER.
Emognition: A newer dataset with ten emotions, including amusement and awe, used in recent studies for expanded emotion classification.
I also mentioned using DeepFace, a Python library that wraps state-of-the-art models like VGG-Face, FaceNet, OpenFace, DeepID, ArcFace, Dlib, SFace, GhostFaceNet, and Buffalo_L for facial attribute analysis, including emotion recognition. DeepFace employs pre-trained CNNs, likely trained on datasets like FER2013, to classify emotions with high accuracy. This approach saved me countless computational hours from training models from scratch, leveraging the "heavy lifting" done by researchers who developed and optimized these models on large datasets. DeepFace's real-time capability and ease of integration align with my need for speed and simplicity I wanted to have at the heart of my project.
Uniqueness About My Project: I prioritized real-time performance and simplicity. Unlike systems needing custom training or beefy hardware, I leaned on DeepFace’s pre-trained magic for quick deployment and solid accuracy. I also added face tracking, age estimation, emotion confidence scores, and a log to review the output, making it more practical and transparent than many prior efforts. By comparing my two program files "vision_live_5.py" and "vision_live_6.py", I explored the trade-offs between simplicity and complexity—something a lot of research glosses over. Just because a system is more complex, does not necessarily mean it will produce the best results. That theme of adding more complexity, and not getting better results is another area I wanted to focus and highlight in this project.
Approach & Design:
In this section, I'll describe my design decisions behind the live emotion detection system implemented in vision_live_5.py and vision_live_6.py. The system processes a live webcam feed to detect, track, and analyze faces, outputting a video stream annotated with face IDs, emotions, intensity levels, ages, and emotion trends. Below, I break down each component of the pipeline, explaining the algorithms, tools, inputs, outputs, and rationale, with relevant code snippets from both versions to illustrate the implementation.
Input: A live video feed captured from a webcam.
Output: A real-time video stream with bounding boxes drawn around detected faces, each labeled with a unique ID, the dominant emotion, its intensity (High, Medium, or Low based on confidence scores), an estimated age, and, in some cases, an emotion trend over time. In vision_live_6.py, additional outputs include face recognition results and 3D facial visualizations.
The system is designed for real-time performance, balancing accuracy and computational efficiency. It leverages dlib for face detection, DeepFace for emotion and age analysis, and OpenCV for visualization, with threading to ensure smooth video processing. vision_live_6.py extends the core functionality with face recognition and 3D face reconstruction, adding complexity but also richer outputs.
Face Detection: The system uses dlib’s frontal face detector, a robust and widely-used tool based on Histogram of Oriented Gradients (HOG) and a linear classifier. Each incoming video frame is converted to grayscale to reduce computational load and improve detection reliability. The dlib detector scans the grayscale frame and returns bounding boxes (x, y, width, height) for each detected face. These boxes serve as the foundation for tracking and analysis. Dlib’s detector is chosen for its excellent balance of speed and accuracy, making it suitable for real-time applications. It performs well under moderate lighting conditions and with frontal or near-frontal faces, which aligns with the system’s assumptions. Compared to alternatives like OpenCV’s Haar cascades, dlib’s HOG-based approach is more robust to variations in pose and expression.
# Initialize dlib's face detector
detector = dlib.get_frontal_face_detector()
# Convert to grayscale for detection
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Detect faces using dlib
faces = detector(gray)
Preprocessing: Histogram equalization is applied to enhance frame quality before analysis. Each frame is converted to grayscale, and histogram equalization is used to normalize the brightness and increase contrast. The enhanced grayscale frame is then converted back to a BGR color format for compatibility with DeepFace’s analysis pipeline. Preprocessing improves the robustness of DeepFace’s emotion and age analysis, particularly in challenging lighting conditions. Histogram equalization is lightweight and effective, making it ideal for real-time applications where computational resources are limited.
# Apply lightweight preprocessing to enhance frame quality for analysis.
def preprocess_frame(frame):
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
enhanced = cv2.equalizeHist(gray)
return cv2.cvtColor(enhanced, cv2.COLOR_GRAY2BGR)
Face Tracking: The system assigns a unique ID to each detected face and tracks it across frames using a simple distance-based matching algorithm. For each detected face, the center of its bounding box is computed, and the Euclidean distance to the centers of previously tracked faces is calculated. If the distance is below a 50-pixel threshold, the face is matched to an existing tracked face; otherwise, a new TrackedFace object is created with a new ID. Faces that are not detected for 10 consecutive frames are removed to prevent memory buildup. Tracking ensures continuity of identity in a dynamic video feed, where faces may move, temporarily disappear (due to occlusion), or reappear. The 50-pixel threshold is a heuristic chosen for its simplicity and effectiveness in typical webcam resolutions, assuming moderate movement between frames. The TrackedFace class encapsulates all relevant data (ID, bounding box, emotion history, etc.), making it easy to manage multiple faces.
# Match detected faces to tracked faces
for face in faces:
x, y, w, h = face.left(), face.top(), face.width(), face.height()
center = (x + w/2, y + h/2)
matched_tf = None
min_dist = float('inf')
# Find closest tracked face within threshold
for tf in tracked_faces:
tf_center = (tf.bbox[0] + tf.bbox[2]/2, tf.bbox[1] + tf.bbox[3]/2)
dist = np.linalg.norm(np.array(center) - np.array(tf_center))
if dist < min_dist and dist < 50: # Threshold of 50 pixels
min_dist = dist
matched_tf = tf
if matched_tf:
# Update existing tracked face
matched_tf.bbox = (x, y, w, h)
matched_tf.last_seen = frame_count
matched_tf.missed_frames = 0
else:
# Add new tracked face
new_id = len(tracked_faces) + 1
tracked_faces.append(TrackedFace(new_id, (x, y, w, h)))
# Update missed frames and remove lost faces
for tf in tracked_faces[:]:
if tf.last_seen < frame_count:
tf.missed_frames += 1
if tf.missed_frames > 10: # Remove after 10 missed frames
tracked_faces.remove(tf)
Emotion & Age Analysis: For emotion and age analysis, I used DeepFace, a deep learning framework that uses pre-trained CNNs for facial analysis. Every 5 frames, a preprocessed frame is sent to a separate thread for analysis to avoid blocking the main video loop. DeepFace analyzes each detected face for emotions (happy, sad, angry) and age, using the OpenCV detector backend for consistency with the preprocessing pipeline. The analysis thread returns emotion labels, confidence scores, and age estimates. The enforce_detection = False setting ensures that analysis proceeds even if face detection is imperfect, improving robustness. DeepFace is a powerful, off-the-shelf tool that provides high-quality emotion and age predictions without requiring custom model training. The 5-frame interval balances accuracy with performance, as analyzing every frame would be too slow for real-time use. Threading ensures that the video feed remains smooth, with analysis running in parallel.
def analyze_emotion_thread(frame_queue, result_queue):
# Perform emotion, age, 3D reconstruction, and embedding analysis in a separate thread.
while True:
try:
frame = frame_queue.get(timeout=1)
if frame is None: # Signal to exit thread
break
# Analyze emotions, age, and generate embedding (no gender)
results = DeepFace.analyze(frame, actions=['emotion', 'age'],
detector_backend='opencv', enforce_detection=False)
analysis_results = []
for result in results:
region = result['region'] # Bounding box from DeepFace
dominant = result['dominant_emotion'] # Most prominent emotion
scores = result['emotion'] # Emotion scores dictionary
age = result['age'] # Estimated age
# Crop face for 3D and embedding (vision_live_6.py only)
x, y, w, h = region['x'], region['y'], region['w'], region['h']
face_img = frame[y:y+h, x:x+w]
vertices = reconstruct_face(face_img) # 3D reconstruction
embedding = DeepFace.represent(face_img, model_name='Facenet')[0]['embedding']
analysis_results.append((region, dominant, scores, age, vertices, embedding))
result_queue.put(analysis_results)
except queue.Empty:
continue # Wait for frame if queue is empty
except Exception as e:
result_queue.put([{"error": str(e)}]) # Report errors
Intensity & Trends: The system categorizes the confidence score of the dominant emotion into three levels: High (>70%), Medium (>40%), or Low (≤40%). This provides a qualitative measure of how confident the model is in its prediction. Each TrackedFace object maintains a history of the last 10 emotions in a deque. Once at least 5 emotions are recorded, the most common emotion (determined using Counter) is displayed as the trend, offering insight into the face’s emotional state over time. Intensity adds interpretability to the raw confidence scores, making the output more user-friendly. Trends provide temporal context, helping to distinguish fleeting expressions from persistent emotional states. The 10-emotion history and 5-sample threshold are chosen to balance responsiveness with stability.
def get_intensity(score):
# Determine the intensity of the emotion based on the confidence score.
if score > 70:
return "High"
elif score > 40:
return "Medium"
else:
return "Low"
# In the main loop, for displaying trends:
if len(tf.emotion_history) >= 5:
# Show trend after 5 samples
trend = Counter(tf.emotion_history).most_common(1)[0][0]
text += f" (Trend: {trend})"
Visualization: OpenCV is used to draw colored bounding boxes around each tracked face, with text overlays showing the face ID, dominant emotion, intensity, age, and trend (if available). Each face is assigned a random color for visual distinction. In vision_live_6.py, the frame is resized to fit a resizable window (800x600 pixels), and 3D face models are displayed in separate matplotlib windows (non-blocking). Face recognition results are also included in the text overlay. Visualization makes the system’s outputs intuitive and immediate. Colored boxes and text overlays allow users to quickly understand the analysis results, while the resizable window in vision_live_6.py improves usability. The 3D visualization, demonstrates the potential for richer output formats.
Experiments:
Hardware: A laptop with a built-in webcam.
Software: Python, OpenCV, dlib, DeepFace.
Conditions: A well-lit room, 1-2 people.
Criteria:
Speed: Frames per second (FPS).
Accuracy: How well emotions matched my observations.
Efficiency: Handling multiple faces without lag.
Experiment 1: Speed Test
Results:
Vision_live_5.py: ~15 FPS—smooth and snappy.
Vision_live_6.py: ~5 FPS—laggy, thanks to 3D reconstruction.
Learned: Simpler is faster for live use.
Experiment 2: Comparing The Two System Outputs
Results:
Both nailed obvious emotions (happy, angry).
Vision_live_5.py updated faster, feeling more responsive.
Vision_live_6.py occasionally misfired due to delays.
Learned: Extra features didn’t boost accuracy.
Experiment 3: Multi-Face Handling Test efficiency with 2+ faces. Had two people in frame, while I watched for lag.
Results:
Vision_live_5.py: Handled it like a champ.
Vision_live_6.py: Stuttered with multiple faces.
Learned: Complexity kills efficiency. Vision_live_5.py is the optimal program and is great at speed, efficiency, and accuracy.
Conclusion:
Through rigorous testing on live video feeds, I evaluated both systems based on frame rate (frames per second, FPS), emotion detection accuracy, and overall efficiency. The results revealed that vision_live_5.py outperformed its counterpart, achieving a frame rate of approximately 15 FPS compared to vision_live_6.py’s slower 5 FPS. Furthermore, vision_live_5.py demonstrated superior efficiency and accuracy, making it better suited for real-time applications. While vision_live_6.py’s advanced features offered potential for specialized use cases, they did not significantly enhance core emotion detection performance, highlighting the trade-offs between complexity and practicality. These findings underscore the importance of optimizing for speed and reliability in real-time systems, particularly when deploying solutions in dynamic, real-world settings such as customer service, mental health monitoring, or interactive human-computer interfaces.
This project not only demonstrates the feasibility of real-time emotion and age detection but also contributes to the broader discourse on human-computer interaction. By providing a system that is both accessible—thanks to DeepFace’s pre-trained models—and transparent through confidence scores and trend analysis, it paves the way for applications that enhance AI’s ability to understand and respond to human emotions. The comparison between vision_live_5.py and vision_live_6.py offers valuable insights into the balance between innovation and efficiency, informing future developments in computer vision and affective computing.
Limitations:
Lighting - Fumbles in dim conditions.
Angles - Struggles with side profiles.
Emotions - Stuck to DeepFace’s basic set (no “confused” or “bored”).
Future Work:
Handle low light and weird angles better.
More Emotions: Add nuanced categories.
Optimize vision_live_6.py: If speed improves, 3D and recognition could shine in niche cases.