Mrinal Verghese and Christopher Atkeson
Â
This study uses various internet data sources to select template robot behaviors to perform skills. Learning contact-rich skills involving tool use from sources of internet data has typically been challenging due to the lack of physical information present in this data. We hypothesize that internet data and foundation models trained on this data may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: large language models, comparison to retrieved human video using features from pretrained a video encoder, and comparison to human video using learned optical flow features. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we are able to create a template selector using multiple forms of internet data that achieves a 79% success rate on a set of 16 different tool-use cooking skills.
Below are videos of templates selected by each of the three template selection methods, as well as a method combining LLM selection with optical flow encoding selection. Note some videos are duplicates where multiple methods selected the same template.
Cut a Bell Pepper with a Knife
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Cut a Carrot with a Knife
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Cut a Cucumber with a Knife
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Cut a Mushroom with a Knife
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Peel a Carrot with a Peeler
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Peel a Cucumber with a Peeler
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Scrape a Cutting Board with a Bench Scraper
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Scrape a Cutting Board with a Knife
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Scrub a Cutting Board with a Sponge
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Scrub a Plate with a Sponge
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Slice a Pizza with a Pizza Cutter
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Spread Sauce with a Spoon
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Stir a Pan with a Spatula (Peppers)
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Stir a Pan with a Spatula (Sauce)
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Wipe a Cutting Board with a Towel
LLM Selection
Pretrained Video Encoder
Optical Flow Encoder
LLM + Optical Flow
Wipe a Plate with a Towel
LLM Selection
Pretrained Video Encoder
Pretrained Video Encoder
LLM + Optical Flow