TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction
Kojiro Takeyama1,2, Yimeng Liu1, Misha Sra1
1: University of California Santa Barbara, 2: Toyota Motor North America
Kojiro Takeyama1,2, Yimeng Liu1, Misha Sra1
1: University of California Santa Barbara, 2: Toyota Motor North America
Accurate prediction of human behavior is crucial for AI systems to effectively support real-world applications, such as autonomous robots anticipating and assisting with human tasks. Real-world scenarios frequently present challenges such as occlusions and incomplete scene observations, which can compromise predictive accuracy. Thus, traditional video-based methods often struggle due to limited temporal and spatial perspectives. Large Language Models (LLMs) offer a promising alternative. Having been trained on a large text corpus describing human behaviors, LLMs likely encode plausible sequences of human actions in a home environment. However, LLMs, trained primarily on text data, lack inherent spatial awareness and real-time environmental perception. They struggle with understanding physical constraints and spatial geometry. Therefore, to be effective in a real-world spatial scenario, we propose a multimodal prediction framework that enhances LLM-based action prediction by integrating physical constraints derived from human trajectories. Our experiments demonstrate that combining LLM predictions with trajectory data significantly improves overall prediction performance. This enhancement is particularly notable in situations where the LLM receives limited scene information, highlighting the complementary nature of linguistic knowledge and physical constraints in understanding and anticipating human behavior.
Figure1. Overview of our approach
Figure.1 illustrates an overview of our approach. We propose a multi-modal action prediction framework that incorporates both an LLM and human trajectories. The core idea is to integrate two different perspectives—physical and semantic factors—through an object-based action prediction framework. Our framework consists of two primary steps: target object prediction and action prediction .
Target object prediction: In the target object prediction step, we first utilize a LLM to predict a person's target object based on the input scene context, generating a probability distribution over potential objects in the room from a semantic perspective. Subsequently, we incorporate the person's past trajectory to infer their likely destination, applying physical constraints to refine the prediction of the next target object.
Action prediction: At this stage, we employ the LLM to predict the action corresponding to the target object identified in the previous step. The input to the LLM includes the scene context and the predicted target object, and the LLM outputs the most plausible action the person is likely to perform within the given scene context.
We employed ChatGPT-4o to evaluate prediction performance, using three distinct types of input data to assess the model's robustness under degraded input conditions. Specifically, the inputs included: (1) full scene context from the evaluation data, (2) scene context with conversation excluded, and (3) scene context with conversation and past action history excluded. These input variations were designed to simulate real-world scenarios where certain observable information may be unavailable.
Quantitative result:
Figure2 illustrates the accuracy of target object prediction. While the accuracy of LLM-based predictions decreases as input scene information is limited, our method significantly mitigates this degradation. Furthermore, it is noteworthy that our method outperforms both the standalone performance of the LLM and the trajectory-based predictions, demonstrating a synergistic improvement in performance through the integration of LLM and trajectory data.
Figure3 presents the accuracy of action prediction based on the predicted target object. A similar trend is observed as in the target object prediction. Note that overall accuracy is lower than that of target object prediction because it is a subsequent process dependent on the target object prediction.
Figure2. Target object prediction
Figure3. Action prediction
Qualitative results:
Figure4 presents the results of target object prediction using our proposed method. In contrast to presents the results of target object prediction using our proposed method. In contrast to predictions using either the LLM or trajectory data alone, which retain some uncertainty, our method demonstrates a faster and more accurate narrowing of the target object. This indicates that by leveraging the complementary perspectives of the LLM’s semantic understanding and the trajectory’s physical context, the groundtruth target object is more effectively identified.
Figure4. Qualitative result
・B.1 LLM-based Prediction
・B.2 Trajectory-based Prediction
・C.1 Scenes
・C.2 Trajectory Generation
@article{takeyama2024tr,
title={TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction},
author={Takeyama, Kojiro and Liu, Yimeng and Sra, Misha},
journal={arXiv preprint arXiv:2410.03993},
year={2024}
}