Zihui Xue1*,2, Kristen Grauman2, Dima Damen1, Andrew Zisserman1 , Tengda Han1
1Google DeepMind, 2The University of Texas at Austin
*Work done during internship at Google DeepMind
arXiv | code (coming soon)
(1)
(2)
(3)
Options
(a) basketball layup
(b) walk down a long hallway
(c) move the tire from left
(1) expand to see answer
(c) move the tire from left
(2) expand to see answer
(a) basketball layup
(3) expand to see answer
(b) walk down a long hallway
One sees the environment not with the eyes but with the eyes-in-the-head-on-the-body-resting-on-the-ground.
—James J. Gibson
Animation credit: Ralph Ammer's Blog
(a) We propose CamFormer to project camera trajectories into a joint embedding space with text.
(b) CamFormer Architecture (with Contextualized Trajectory Encoding)
A 5-minute silent video designed to supplement the paper
Results Overview
CamFormer achieves great performance gains over base models / methods on 10 downstream tasks across 5 datasets.
Embedding Space Visualization
CamFormer's PCA embeddings on unseen Ego-Exo4D camera trajectories, colored by the 8 activity labels. CamFormer successfully groups trajectories that share action semantics, even if their specific motion patterns for overall activity labels are different.
Egocentric Ego-Exo4D Example
Options:
A. C holds the orange hold with his right hand.
B. C rises towards the bouldering wall.
C. C observes the bouldering wall.
D. C touches the yellow hold with his left hand.
E. C lands on the mat.
CamFormer prediction (camera trajectory as input): E ✅
Video baseline prediction (ego video as input): D ❌
Egocentric Nymeria Example
A. C moves her left foot to the right, taps the floor and steps to the left with her left foot.
B. C raises her left foot behind her and slightly raises her right leg on her side.
C. C is stepping in place from side to side with her left and right legs alternately.
D. C bends both of her legs alternately.
E. C walks backwards with her left and right leg.
CamFormer prediction (camera trajectory as input): C ✅
Video baseline prediction (ego video as input): B ❌
Exocentric DynPose-100K Example
A. A finger pointing at a cutting board with vegetables on it.
B. A man is wearing a cap and talking to the camera in front of a backdrop of mountains.
C. There is a woman standing in front of a garden filled with vegetables, she is smiling and holding a camera.
D. A veterinarian is using a syringe to draw blood from a newborn puppy.
E. A man is hiking in a forest and taking a selfie with his phone
CamFormer prediction (camera trajectory as input): A ✅
UT Austin is supported in part by the IFML NSF AI Institute. We thank Junyu Xie for assistance and valuable suggestions on the camera pose estimation pipeline; Jean-Baptiste Alayrac, Carl Doersch and Ignacio Rocco for constructive feedback; and Ang Cao, Yue Zhao and Lorenzo Torresani for helpful discussions.
If you find our project helpful to your research, you can cite us with:
@article{xue25seewithoutpixels,
title={Seeing without Pixels: Perception from Camera Trajecories},
author={Xue, Zihui and Grauman, Kristen and Damen, Dima and Zisserman, Andrew and Han, Tengda},
journal={arXiv preprint arXiv:2511.21681},
year={2025}
}