Research conducted as a member of the Applied Motion Lab and the Vision Lab, tackling problems relating to quantifying and simulating human behavior.
Advised by Dr. Stephen J. Guy and Dr. Hyun Soo Park.
We leveraged Ego-Exo4D demonstrations to augment VLMs in two ways: through understanding spatial task-affordances, and the localization of that task relative to the egocentric viewer. We then demonstrate this system on a simulated robot.
A massive-scale exo+egocentric video-language dataset and benchmark suite for skilled activities.
My contributions included assisting with dataset standardization, on-site data collection, and annotation.
(a) Example of paired video data
(b) Example 3D Reconstruction (Basketball)
By using a small dataset of human poses, we're able to learn a geometry-aware pose prediction network which is used to augment the reward function for reinforcement learning. Our system improved robot efficiency over SotA for house-cleaning tasks.
A massive-scale egocentric video-language dataset and benchmark suite for everyday activities.
My contributions included 3D reconstruction of first-person walking videos and a benchmark implementation to predict future trajectories given a first-person image.
(a) Necessary Geometry
(b) Future trajectory prediction
Our system is able to jointly predict the navigational affordances and future motion of the observer as implicit fields aligned with the image-space features.
(a) First-Person Image
(c) Goal at House
(b) Inferred Walkability
(d) Goal at Horizon
We optimize differentiable fields based on sparse user-defined rules to represent navigation policies defined over all of space for mobile agents.
Example of mobile robot following a navigation field around obstacles.
Applying a novel deep learning framework to discovering and simulating the equations that govern how agents in a crowd move as a continuum. Images show estimates of crowd flow and density.
Using a series of images from the same frame, we want to estimate the interaction between each person by extracting the socially salient features of the image. Several people are reconstructed in 3D, and an estimate of their gaze direction is used to determine the socially salient features. This is the first step in an ongoing project.
In-depth analysis of method
Abbreviated version of final project paper
Using an RGB-D camera, I developed a robot-mounted vision system which can interpret point clouds to facilitate localization and planning. The robot can recognize and avoid obstacles in an environment using any planning algorithm that takes in a collection of objects to avoid.
Using machine learning, Dr. Guy and I are using the game Flappy Bird as a subject for teaching AI to play a game as a human would. The way I play the game is recorded via only the pixel state of the screen, fed into a neural network, and the resulting weights are used as the policy for an automated controller. The controller then "sees" the game, and decides what to do based on the image. The basic framework is in C++ on Windows for this project, utilizing OS level timing, frame capturing, and input capturing. The machine learning framework is done separately in Python. This is an ongoing project temporarily on hiatus.