Summary: The experiment is conducted using the ETH Hotel, ETH University (paper, data) and Zara01, Zara02 (paper, data) datasets. Each of these is recorded at 25 frames per second (fps) while the data acquisition is applied at 2.5fps. The datasets comprise approximately 4000 frames and contain approximately 1600 agents that follow both linear and non-linear trajectories. These include agents walking on their own, in social groups or passing by one another. We assume that we know in advance the locations of all agents at each time step and that we have access to the video arrays in an RGB format. We combine these datasets, following previous work (paper 1), (paper 2) and apply a leave-one-out procedure during training for both dynamic components D and B.
Appendix
A comparison of model performance in street scenes. The first two rows show results on ETH University and ETH Hotel respectively (paper, data), while the remainder are from UCY Zara01 and Zara02 (paper, data). An agent's trajectory is tracked for 4 steps (light blue) and is then predicted for the next 12 (ground truth depicted in yellow). Each row represents a different dataset while each column is a different model. Red depicts a simple LSTM with a stochastic output, in magenta we show results of modelling only the social interactions between agents (Social LSTM) and green are the results from this work. The proposed approach takes environmental cues into consideration and maintains more consistent predictions.