Project done as part of the course 11-785 (Introduction to Deep Learning) at Carnegie Mellon University (Spring 2023)
The ability to track individuals and their movements have become increasingly important in various applications such as video surveillance, human-computer interaction, and robotics. However, simply identifying and tracking individuals is not always enough, as collaboration and joint attention are crucial in many scenarios. To address this need of detection in collaborative environments, in this project, the goal was to integrate person-tracking (locating and tracking individuals), re-identification (Re-id: Identifying individuals in different frames of input), and joint attention tracking (monitoring individuals for attention and engagement). The aim of this project was to combine these three technologies (Detection + Re-Id + Attention) to create a system that can enable us to better understand how individuals interact with each other and with their environment, and ultimately improve collaboration and communication in various applications. Responsible for handling object detection training and connection to re-identification.
The overall timeline of the project started with baseline implementation of individual modules, creation of novel dataset and subsequently, connection of all modules which in turn is used for testing. The first step of this integrated framework was to estimate the current benchmark of individual modules which is mentioned as baseline implementation. The second step of the overall project was to create the novel dataset for a classroom environment which also contained edge case scenarios of multiple-frame identification. The final step in the overall timeline was to fine-tune individual modules and integrate them into a data flow pipeline, where data flows from Input -> Detection -> Re-identification -> Attention Tracking.
The evaluation phase of the project was done in conjunction with experimentation on hyperparameters and fitment on the individual modules. To assess the final working model, the evaluation consisted of two parts, i.e., qualitative and quantitative analysis. To test the quantitative aspects of each of the three individual modules, standard evaluation metrics used widely were considered. The metrics used to test out the individual modules are mentioned below (The quantitative results are shown below, all these metrics were improved in the fine-tuned model in comparison to the baseline models):
Detection: mAP (mean Average Precision), Precision, Recall.
Best Score achieved of baseline after experimentation [mAP – 92.4%, Precision – 86.0%, Recall – 90.2%]
Best Score achieved of final module after experimentation [mAP – 99.5% (~8% increase w.r.t. baseline), Precision – 97.1% (~13% increase w.r.t. baseline), Recall – 95.5% (~6% increase w.r.t. baseline)]
Re-Identification: HOTA (Higher Order Tracking Accuracy) metric was used in different scenarios such as group discussion, quick movement and slow movement which were the edge cases and final scores averaged out.
Best Score achieved of baseline after experimentation [Combined HOTA – 82.8%]
Best Score achieved of final module after experimentation [Combined HOTA – 90.8% (~10% increase w.r.t. baseline)]
Joint Attention: AUC (Area Under Curve), L2 Distance and mAP (mean Average Precision).
Best Score achieved of baseline after experimentation [AUC – 0.8569, L2 Distance – 0.1539, mAP – 98.77%]
Best Score achieved of final module after experimentation [AUC – 0.8942 (~4% increase w.r.t. baseline), L2 Distance – 0.1322 (~14% decrease w.r.t. baseline), mAP – 98.78% (~1% increase w.r.t. baseline)]
To assess the qualitative improvements in the final connected framework, improvements done to the baseline implementation are highlighted by addressing what failed in the baseline implementation and how fine-tuning improved the drawbacks each baseline faced. In addition, a final demo is also shown where a typical classroom environment is replicated, and the results are visible.
In conclusion, we were able to develop a unique model that integrates person-tracking, re-identification, and gaze detection into a single framework. This approach of integration of individual modules enabled the model to perform inference directly on raw video data without requiring any annotation or pre-processing. Additionally, our model is the first to evaluate attention based on body bounding boxes, not just faces, thereby increasing its overall usefulness. The model can perform qualitative analysis of classroom environments and aid professors and educators in gaining a deeper understanding of the learning process. Our next step would be to quantify the complete model output and reduce the redundancy from multiple data modules.