Crowds Prediction

Summary: The experiment is conducted using the ETH Hotel, ETH University (paper, data) and Zara01, Zara02 (paper, data) datasets. Each of these is recorded at 25 frames per second (fps) while the data acquisition is applied at 2.5fps. The datasets comprise approximately 4000 frames and contain approximately 1600 agents that follow both linear and non-linear trajectories. These include agents walking on their own, in social groups or passing by one another. We assume that we know in advance the locations of all agents at each time step and that we have access to the video arrays in an RGB format. We combine these datasets, following previous work (paper 1), (paper 2) and apply a leave-one-out procedure during training for both dynamic components D and B.

Architectural Details

Appendix

Component R

batch size: 50
number of epochs: 1200
input: batch_size x W=64 x H=64 x 3
VAE Embedding size: 96
learning rate: 0.001
optimiser: Adam
KL tolerance: 0.5

Component D

batch_size: 10
number of epochs: 100
RNN output size: 128
cell size: 768
Sequence length: 1 (10 during training D)
maximum number of agents in 1 frame: 40
Decay rate: 0.95
Learning rate: 0.003
Gradient clip: 10
temperature 𝜏 : training: 1.0, inference: 0.21
Mixture of Gaussians: 5
optimiser: Adam
VAE Embedding size: 96
l_t: sampled from mu and sigma obtained from a pre-trained R
input:
- batch_size x max_num_agents x 2; and
- batch_size x l_t

Component B

batch_size: 10
number of epochs: 100
LSTM output size: 128
cell size: 40x256
Sequence length at training: 8
maximum number of agents in 8 consecutive frames: 40
λ for L2 regularisation: 0.0005
Learning rate: 0.0005
Gradient clip: 10
optimiser: RMSProp
Embedding size: 64
input:
- batch_size x sequence_length x max_num_agents x 2; and
- batch_size x sequence_length x [l_t, h_t]

Visual Results

A comparison of model performance in street scenes. The first two rows show results on ETH University and ETH Hotel respectively (paper, data), while the remainder are from UCY Zara01 and Zara02 (paper, data). An agent's trajectory is tracked for 4 steps (light blue) and is then predicted for the next 12 (ground truth depicted in yellow). Each row represents a different dataset while each column is a different model. Red depicts a simple LSTM with a stochastic output, in magenta we show results of modelling only the social interactions between agents (Social LSTM) and green are the results from this work. The proposed approach takes environmental cues into consideration and maintains more consistent predictions.

Page updated

Google Sites

Report abuse