Home

Object Trajectory Prediction

In the following video, we show object trajectory prediction test results in the Stanford Drone Dataset. 
The input is N concatenated glimpses (we show in red the glimpse bounding boxes ) centered at the object of interest and the output is a motion trajectory or samples of motion trajectories. 
Along the rows from top to bottom: N=1  -- N=5  -- N=8
Along the columns from left to right: regression -- variational approx (VAE) -- Kbestloss (MCmin)  
Groundtruth future trajectories are shown in green, regression predictions are shown in red and the samples from Kbestloss and Variational Approximation are shown in blue.
Notice that longer visual history leads to sharper and more accurate trajectories for all three models. Notice also the nice multimodal trajectory distribution from K-best-loss, especially near intersections, or when N=1, where the uncertainty is very high. Interestingly, the position of the object (at the left or right part of the road) often times conveys enough information for its future direction.



Frame Prediction through Warping  of Predicted Motion Field



Human3.6M dataset
We use all the videos contained in the dataset depicting various activities, such as Walking, Waiting, Discussing, Eating, Posing, Sitting, Purchasing, Smoking etc. We train our model from scratch on all but the 5th actor and test on the 5th actor of the dataset.  We show below randomly chosen samples from the test set of the VA model that has 4 concatenated frames as input
We show in green border the conditioning input frame history and with red border the predicted frame and corresponding optical flow field. 




We show below randomly chosen samples from the test set of the MCbest model that has 1 frame as input (high uncertainty).  In the 1st, 3rd and 5th rows, we show in green border the conditioning input frame history and with red border a predicted frame and corresponding optical flow field. In the 2nd, 4rth and 6th rows we show multiple samples concatenated as gif from our model to demonstrate the diversity of its predictions. Notice the variations in the movements of the person, his arms and legs.








MoSeg dataset
We use the designated train and test split of the dataset. 
We  show below random samples from the test set  using our variational approximation model with convolutional stochastic sampling and late fusion of features in the recognition network and 256 code length.  Despite the small size of the training (~5000 frames), frame prediction generalizes thanks to learning the motion transformation rather than pixel generation directly. Stabilizing the video before hand to eliminate camera motion is likely to improve the results much more. We did not consider this to keep the method clean.   We show in green border the conditioning input frame history (2 frames) and with red border the predicted frame and corresponding optical flow field. Notice that the flow field delineates the objects and the predicted frame is a natural future evolution of the sequence. 

Google Presentation


Forecasting motion of  human body joints
We use  videos of the  Walking activity. We train our model from scratch on all but the 5th actor and test on the 5th actor of the dataset.  As inputs we use the last three video frames. We visualize one ankle and one wrist only for clarity. 
We show from left to right: Groundtruth-Regression-MCmin-VAE
Notice that K-best-loss (MCmin) often predicts two modes (forward and backward of walking) while the variational relaxation correctly predicts the unambiguous future motion. This is a limitation of forecasting on a budget: as long as the correct answer is contained in the budget (15 choices here), the model is not encouraged to improve its predictions. On the contrary, the variational approximation assigns the probability mass on the right future outcome.