Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue1*,2, Kristen Grauman2, Dima Damen1, Andrew Zisserman1 , Tengda Han1

1Google DeepMind, 2The University of Texas at Austin

*Work done during internship at Google DeepMind

arXiv | code (coming soon)

Can you guess which action goes with which camera trajectory?

(1)

(2)

(3)

Options

(a) basketball layup

(b) walk down a long hallway

(1) expand to see answer

(2) expand to see answer

(a) basketball layup

(3) expand to see answer

(b) walk down a long hallway

Motivation

How do we humans perceive the world around us?

One sees the environment not with the eyes but with the eyes-in-the-head-on-the-body-resting-on-the-ground.

—James J. Gibson

Animation credit: Ralph Ammer's Blog

Behind every video lies the operator's intent, physically written into the camera trajectory.

Method

(a) We propose CamFormer to project camera trajectories into a joint embedding space with text.

(b) CamFormer Architecture (with Contextualized Trajectory Encoding)

Video

A 5-minute silent video designed to supplement the paper

SeeingWithoutPixels_website.mp4

Results

Results Overview

CamFormer achieves great performance gains over base models / methods on 10 downstream tasks across 5 datasets.

Embedding Space Visualization

CamFormer's PCA embeddings on unseen Ego-Exo4D camera trajectories, colored by the 8 activity labels. CamFormer successfully groups trajectories that share action semantics, even if their specific motion patterns for overall activity labels are different.

Qualitative Text Retrieval

Egocentric Ego-Exo4D Example

17195_1s_C lands on the mat..mp4

Options:

A. C holds the orange hold with his right hand.

B. C rises towards the bouldering wall.

C. C observes the bouldering wall.

D. C touches the yellow hold with his left hand.

E. C lands on the mat.

CamFormer prediction (camera trajectory as input): E ✅

Video baseline prediction (ego video as input): D ❌

Egocentric Nymeria Example

3920_nohist_C is stepping in place from side to side with her left and right legs alternately..mp4

A. C moves her left foot to the right, taps the floor and steps to the left with her left foot.

B. C raises her left foot behind her and slightly raises her right leg on her side.

C. C is stepping in place from side to side with her left and right legs alternately.

D. C bends both of her legs alternately.

E. C walks backwards with her left and right leg.

CamFormer prediction (camera trajectory as input): C ✅

Video baseline prediction (ego video as input): B ❌

Exocentric DynPose-100K Example

2365_vipe_nohist_A_finger_pointing_at_a_cutting_board_with_vegetables_on_it..mp4

A. A finger pointing at a cutting board with vegetables on it.

B. A man is wearing a cap and talking to the camera in front of a backdrop of mountains.

C. There is a woman standing in front of a garden filled with vegetables, she is smiling and holding a camera.

D. A veterinarian is using a syringe to draw blood from a newborn puppy.

E. A man is hiking in a forest and taking a selfie with his phone

CamFormer prediction (camera trajectory as input): A ✅

Acknowledgement

UT Austin is supported in part by the IFML NSF AI Institute. We thank Junyu Xie for assistance and valuable suggestions on the camera pose estimation pipeline; Jean-Baptiste Alayrac, Carl Doersch and Ignacio Rocco for constructive feedback; and Ang Cao, Yue Zhao and Lorenzo Torresani for helpful discussions.

Citation

If you find our project helpful to your research, you can cite us with:

@article{xue25seewithoutpixels,

title={Seeing without Pixels: Perception from Camera Trajecories},

author={Xue, Zihui and Grauman, Kristen and Damen, Dima and Zisserman, Andrew and Han, Tengda},

journal={arXiv preprint arXiv:2511.21681},

year={2025}

}

Page updated

Google Sites

Report abuse