Predicted point trajectories from TRAJAN, colored by their reconstruction error with respect to BootsTAPIR (well reconstructed track, poorly reconstructed track). The video on the left has plausible motion but implausible frame-level appearance, while the video on the right has plausible frame-level appearance but implausible motion (the fingers morph and disappear between frames). TRAJAN correctly predicts this discrepancy by focusing on motion irrespective of appearance.
3 examples of localizing motion errors in space and time using TRAJAN. In each example we show the full video on the left. In the middle, we show the Average Jaccard (AJ) across all points for each frame in the video. This allows us to localize motion errors in time, i.e. the frames for which the AJ is lowest. On the right we show a subset of the video around the frame with the lowest AJ. Here we overlay the point tracks and color code them as before (well reconstructed track, poorly reconstructed track). This lets us localize motion errors in space. With this analysis, we can see that poorly reconstructed tracks occur on objects which morph in appearance.
We filter our human study on the EvalCrafter dataset to obtain subsets of videos that either have high camera speed and low object speed (left) or that have high object speed and low camera speed (right) as indicated by human annotators. We correlate metrics with human rating for each subset separately. In both cases, TRAJAN outperforms all other metrics. TRAJAN is more highly correlated with human annotations for high camera speed and low object speed, and this setting has better inter-rater agreement (lower inter-rater sigma) as well.