TR-LLM: Integrating Trajectory Data for Scene-Aware LLM-Based Human Action Prediction

Kojiro Takeyama1,2, 　Yimeng Liu1, 　Misha Sra1

1: University of California Santa Barbara, 2: Toyota Motor North America

Our paper has been accepted to IROS 2025!

Abstract

Accurate prediction of human behavior is crucial for AI systems to effectively support real-world applications, such as autonomous robots anticipating and assisting with human tasks. Real-world scenarios frequently present challenges such as occlusions and incomplete scene observations, which can compromise predictive accuracy. Thus, traditional video-based methods often struggle due to limited temporal and spatial perspectives. Large Language Models (LLMs) offer a promising alternative. Having been trained on a large text corpus describing human behaviors, LLMs likely encode plausible sequences of human actions in a home environment. However, LLMs, trained primarily on text data, lack inherent spatial awareness and real-time environmental perception. They struggle with understanding physical constraints and spatial geometry. Therefore, to be effective in a real-world spatial scenario, we propose a multimodal prediction framework that enhances LLM-based action prediction by integrating physical constraints derived from human trajectories. Our experiments demonstrate that combining LLM predictions with trajectory data significantly improves overall prediction performance. This enhancement is particularly notable in situations where the LLM receives limited scene information, highlighting the complementary nature of linguistic knowledge and physical constraints in understanding and anticipating human behavior.

Our approach

Figure1. Overview of our approach

Figure.1 illustrates an overview of our approach. We propose a multi-modal action prediction framework that incorporates both an LLM and human trajectories. The core idea is to integrate two different perspectives—physical and semantic factors—through an object-based action prediction framework. Our framework consists of two primary steps: target object prediction and action prediction .

Target object prediction: In the target object prediction step, we first utilize a LLM to predict a person's target object based on the input scene context, generating a probability distribution over potential objects in the room from a semantic perspective. Subsequently, we incorporate the person's past trajectory to infer their likely destination, applying physical constraints to refine the prediction of the next target object. The model infers destination area based on the past trajectory is trained on the LocoVR dataset.

Action prediction: At this stage, we employ the LLM to predict the action corresponding to the target object identified in the previous step. The input to the LLM includes the scene context and the predicted target object, and the LLM outputs the most plausible action the person is likely to perform within the given scene context.

Evaluation

We employed gpt4o-mini to evaluate prediction performance, using three distinct types of input data to assess the model's robustness under degraded input conditions. Specifically, the inputs included: (1) full scene context from the evaluation data, (2) scene context with conversation excluded, and (3) scene context with conversation and past action history excluded. These input variations were designed to simulate real-world scenarios where certain observable information may be unavailable.

Quantitative result:
Figure2 presents the accuracy of target object prediction. We describe the following conclusions:

(a) Our method outperforms both LLM and VLM across GPT and Llama models. Notably, as the available text-based scene context decreases, the prediction accuracy of both LLM and VLM deteriorates significantly, whereas our method maintains robust performance. This demonstrates that the physical cues extracted from trajectories effectively compensate for the reduced semantic context.

(b) VLM performance is comparable to—or slightly better than—that of LLM, and both degrade similarly as the available scene context decreases. These results suggest that pre-trained VLMs may struggle with solving physical problems under complex geometric constraints, which could limit their added value when combined with text-based scene contexts. This is because pre-trained VLMs are not designed to predict physical quantities under complex image constraints, as they are primarily trained to establish correspondences between images and text.

(c) Notably, our method's performance significantly exceeds the average of the standalone LLM and Trajectory approaches, indicating that integrating LLM and trajectory information effectively compensates for individual weaknesses while leveraging their respective strengths.

Figure3 displays the performance on action prediction. A similar trend to that observed in target object prediction is evident, emphasizing that the target object is a key cue for forecasting future actions. These results underscore the effectiveness of our approach compared to other baselines.

Figure2. Performance on target object prediction

Figure3. Performance on action prediction

Qualitative results:

Figure4 displays the target object prediction results using LLM, VLM, standalone trajectory-based method (Trajectory), and our proposed method (LLM+Trajectory). The red point indicates the starting position of the trajectory, while the green line and point represent the observed trajectory and current position, respectively. The yellow distribution illustrates the predicted target area based on the observed trajectory, while the objects—color-coded from blue to red—indicate predicted target object probabilities from low to high.

The figure indicates that the LLM assigns high probabilities to several objects, including the target; however, some mispredictions persist due to the inherent difficulty of inferring a person’s intentions solely from the semantic scene context. Similarly, the VLM produces some incorrect predictions because of its limited capability in physically-aware prediction. Although the exact mechanism remains unclear for the pre-trained model, we observe that the VLM likely to assign relatively high probabilities to objects near the trajectory, indicating that it does not appear to predict the future target area. On the other hand, the standalone trajectory-based target object prediction assigns high probabilities widely around the ground truth target area, yet it struggles to accurately pinpoint the target object—especially when the trajectory has just begun. By leveraging both semantic and physical cues, our method more effectively narrows down the target object compared to using either LLM or trajectory data alone.