Additional Experiments

Additional Analysis of the Crowded Scene Datasets

Quickly looses any moving agents and maintains only the static, discrete ques

Temperature 𝜏 = 0

Keeps more information longer and is less prone to generate agents in random places.

Temperature 𝜏 = 0.5

Predicts 5 steps after observing only 3 on ETH Hotel.

Keeps some of the existing agents but is more likely to generate random agents (upper right corner)

Temperature 𝜏 = 1.

Cell Occupancy of Crowded Scene Datasets

Here, we provide an illustration of the most frequently visited regions in the crowded scenes datasets, showing that Univ is entirely comprised of agents moving in a direction orthogonal to the direction of the training data.

This makes the challenge harder for conventional approaches unless they have been pre-trained on simulated or perturbed data. Alternatively, such examples can be used to examine the ability of a network to adapt to unseen data.

This can bring benefits in contexts where motion is determined by a specific plan that requires drastic change in direction (see robot gripper example) or scenarios in which there are adversaries or other outside-distribution examples.

Evaluating the reconstruction ability of alternative VAEs

Comparison between InfoVAE and ConvVAE performance

The InfoVAE is better at reconstructing the individual agents in a scene.

Comparison between InfoVAE and β-VAE

The results are less accurate, in fact little to no motion is detected over time with β-VAE.

Alternative Approaches

Optical Flow

Can’t deal with background
Can’t deal with shadows

Simple Background Substraction

Individual Background subtraction omits a lot of the agents and cannot fully get rid of the individual noises. Additionally, it is highly varying across different datasets.

Additional Evaluation of Generalisation

Evaluation of the generalisation performance of the proposed solution. In this experiment we trained on 3 out of the 4 crowd prediction datasets. We rotated the 4th one by 0, 90, 180 and 270 degrees. We plot the results for each orientation on a new row. Each column represents an associated 4th dataset and namely (in the order) ETH Hotel, ETH Univ, UCY Zara01, UCY Zara02. For each plot, we observe 1.6 seconds and predict between 3.2 and 9.6 seconds ahead of time. Lower is better. RDB - red, LSTM - yellow and Social LSTM - blue. Overall, RDB performs consistently well over time.

Grids of considered hyper-parameters

Latent space: [8, 16, 32, 64, 96, 128, 256, 512]
batch size: [10, 25, 50, 100, 150]
Learning rate: [0.0001, 0.001, 0.001, 0.01, 0.1]
KL tolerance: [0, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0]
beta term for beta-VAE: [1, 3, 5, 7, 9, 11]
number of epochs: [10, 100, 250, 500, 1000, 1200, 1500, 3000]
autoencoders: [convolution autoencoder, VAE, convVAE, beta-VAE, infoVAE, condVae]
alternative Rs: [Optical Flow, Background Subtract with original KNN, background Subtract with opening KNN]

RNN size: [64, 128, 256, 512, 768, 1024, 2048]
batch size: [10, 50, 100]
sequence length: [8, 10, 12]
learning rate: [0.001, 0.005, 0.009, 0.001, 0.0001, 0.0003, 0.003]
number of mixtures: [1, 2, 3, 4, 5, 10, 15]
alternative predictors: [direct image prediction no agent positions, simple gaussian prediction no agent positions, same RNN-MDN with no agent positions used in the input]

LSTM size: [32, 64, 128, 256, 512, 768, 1024]
input embedding size: [32, 64, 96, 128, 256]
sequence length: [8, 10, 12, 15, 20]
batch size: [1, 10, 25, 50, 100]
learning rate: [0.005, 0.0005, 0.05, 0.01, 0.001, 0.0001]
optimisers: Adam, RMSProp
regularisation: [without any, with L2, with L1+L2, with L1, dropout]
lambda: [0.005, 0.0005, 0.05, 0.01, 0.001, 0.0001]
activation functions: [relu, selu, sigmoid, n/a, tanh, elu, leaky relu]
number of epochs: [10, 50, 100, 150]

Details for the Transfer task

We reuse the same learned models used in the other experiments and consider a few different ways of extracting unsupervised and weakly supervised environmental models. To reiterate, each global dynamics module was trained on 4/5 of the datasets for a given task while each R was assumed to be environment specific. Training D requires labels for each of those considered 4/5 datasets, resulting in a weak supervision due to the label requirements for the task at hand. To alleviate this weak dependence we consider substituting the associated labels with random uniform noise bound by the scope of the allowed agent positions which lie between 0 and 1 due to the assumed normalised pixel space representation. Thus, resulting to the fully unsupervised D*. We report our findings in Table III in the paper. Transfer to robot task was performed from HOTEL model as we found it was the average performer with ETH UNIV being worse and UCY UNIV best. Transfer to crowd was performed by applying B learned from the robot to all five datasets and reporting the average results.

Page updated

Google Sites

Report abuse