**All data collection procedures were approved by the Institutional Review Board (IRB) of New York University Abu Dhabi. All participants provided informed consent.**
The dataset was collected by conducting online lectures (30–90 minutes in duration) on Artificial Intelligence and Mathematics via Zoom in in-the-wild settings.
Total Clips: 8,472 (10-second duration each)
Training Set: 4,978 clips
Test Set: 3,494 clips
Note: The splits are participant-independent. No participant appears in both the training and test sets.
Each clip includes synchronized multimodal information, allowing participants to train unimodal models using only student behavior, or multimodal models leveraging instructor and lecture signals.
video_cascade: A composite spatial layout providing classroom context at 1280 × 1280 resolution (with letterbox padding). It includes:
Student video stream
Instructor video stream
Screen-shared lecture content stream
Synchronized audio stream
student_only: The isolated webcam feed of the student, resized to 224 × 224 resolution using letterbox padding.
Personality Metadata: Trait metadata for both the student and the instructor (provided for the training set only).
Each 10-second clip is rated by 10 independent crowd-sourced annotators using a 5-point Likert scale (1 = very low engagement, 5 = very high engagement). Annotators rated multiple dimensions (Confused, Bored, Engaged, Focused, Interested), but Engaged is used as the ground-truth target variable for this challenge.
Regression Ground-Truth: The continuous engagement score is computed as the mean of the 10 annotator ratings.
Classification Ground-Truth: Binary labels are obtained by thresholding the continuous engagement score.
Quality Control: Only clips with weighted pairwise agreement > 0.6 (moderate-to-substantial inter-annotator agreement) are retained in the released dataset.