Quickly looses any moving agents and maintains only the static, discrete ques
Keeps more information longer and is less prone to generate agents in random places.
Predicts 5 steps after observing only 3 on ETH Hotel.
Keeps some of the existing agents but is more likely to generate random agents (upper right corner)
Here, we provide an illustration of the most frequently visited regions in the crowded scenes datasets, showing that Univ is entirely comprised of agents moving in a direction orthogonal to the direction of the training data.
This makes the challenge harder for conventional approaches unless they have been pre-trained on simulated or perturbed data. Alternatively, such examples can be used to examine the ability of a network to adapt to unseen data.
This can bring benefits in contexts where motion is determined by a specific plan that requires drastic change in direction (see robot gripper example) or scenarios in which there are adversaries or other outside-distribution examples.
The InfoVAE is better at reconstructing the individual agents in a scene.
The results are less accurate, in fact little to no motion is detected over time with β-VAE.
Can’t deal with background
Can’t deal with shadows
Individual Background subtraction omits a lot of the agents and cannot fully get rid of the individual noises. Additionally, it is highly varying across different datasets.
Evaluation of the generalisation performance of the proposed solution. In this experiment we trained on 3 out of the 4 crowd prediction datasets. We rotated the 4th one by 0, 90, 180 and 270 degrees. We plot the results for each orientation on a new row. Each column represents an associated 4th dataset and namely (in the order) ETH Hotel, ETH Univ, UCY Zara01, UCY Zara02. For each plot, we observe 1.6 seconds and predict between 3.2 and 9.6 seconds ahead of time. Lower is better. RDB - red, LSTM - yellow and Social LSTM - blue. Overall, RDB performs consistently well over time.
R:
Latent space: [8, 16, 32, 64, 96, 128, 256, 512]
batch size: [10, 25, 50, 100, 150]
Learning rate: [0.0001, 0.001, 0.001, 0.01, 0.1]
KL tolerance: [0, 0.5, 0.75, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0]
beta term for beta-VAE: [1, 3, 5, 7, 9, 11]
number of epochs: [10, 100, 250, 500, 1000, 1200, 1500, 3000]
autoencoders: [convolution autoencoder, VAE, convVAE, beta-VAE, infoVAE, condVae]
alternative Rs: [Optical Flow, Background Subtract with original KNN, background Subtract with opening KNN]
D:
RNN size: [64, 128, 256, 512, 768, 1024, 2048]
batch size: [10, 50, 100]
sequence length: [8, 10, 12]
learning rate: [0.001, 0.005, 0.009, 0.001, 0.0001, 0.0003, 0.003]
number of mixtures: [1, 2, 3, 4, 5, 10, 15]
alternative predictors: [direct image prediction no agent positions, simple gaussian prediction no agent positions, same RNN-MDN with no agent positions used in the input]
B:
LSTM size: [32, 64, 128, 256, 512, 768, 1024]
input embedding size: [32, 64, 96, 128, 256]
sequence length: [8, 10, 12, 15, 20]
batch size: [1, 10, 25, 50, 100]
learning rate: [0.005, 0.0005, 0.05, 0.01, 0.001, 0.0001]
optimisers: Adam, RMSProp
regularisation: [without any, with L2, with L1+L2, with L1, dropout]
lambda: [0.005, 0.0005, 0.05, 0.01, 0.001, 0.0001]
activation functions: [relu, selu, sigmoid, n/a, tanh, elu, leaky relu]
number of epochs: [10, 50, 100, 150]
We reuse the same learned models used in the other experiments and consider a few different ways of extracting unsupervised and weakly supervised environmental models. To reiterate, each global dynamics module was trained on 4/5 of the datasets for a given task while each R was assumed to be environment specific. Training D requires labels for each of those considered 4/5 datasets, resulting in a weak supervision due to the label requirements for the task at hand. To alleviate this weak dependence we consider substituting the associated labels with random uniform noise bound by the scope of the allowed agent positions which lie between 0 and 1 due to the assumed normalised pixel space representation. Thus, resulting to the fully unsupervised D*. We report our findings in Table III in the paper. Transfer to robot task was performed from HOTEL model as we found it was the average performer with ETH UNIV being worse and UCY UNIV best. Transfer to crowd was performed by applying B learned from the robot to all five datasets and reporting the average results.