Semantic Mechanical Search with Large Vision and Language Models

Satvik Sharma*, Huang Huang*, Kaushik Shivakumar, Lawrence Y. Chen, Ryan Hoque, Brian Ichter, Ken Goldberg

*Equal contribution

Paper | BibTeX | Data

TL;DR: Creating semantic distributions with VLMs and LLMs for downstream mechanical search policies

SMS accepts as input a scene image and a desired target object. It applies an object detection, or segmentation algorithm combined with captioning as necessary when object lists are unavailable. SMS then uses an LLM to compute affinities between detected objects to the target object, and it uses these affinities to output a semantic occupancy distribution which can be used for downstream mechanical search policies.

We propose SMS, a framework using VLMs and LLMs to create a dense semantic distribution between a scene and the target object to be used for downstream tasks. SMS first uses VLMs to perform scene understanding by creating mask-label pairs to densely describe all image portions. It then uses an LLM to generate affinity scores between the labels and the target object. We spatially ground these affinities using the labels’ corresponding masks. In this way, we densely represent the affinities between a target object and all parts of a scene using an LLM. SMS can be applied to two common situations: 1) a closed world where all objects in the scene are a subset of a known list and 2) an open world where some objects in the scene are previously unseen.

We investigate two questions: with a given downstream search policy, (1) can a semantic distribution improve search performance? and (2) what is the best way to generate a semantic distribution?


We investigate the question (2) first to obtain a semantic distribution for the downstream policy. We evaluate semantic distribution generation both in closed-world and open-world environments. In close-world environments, we evaluate the affinity matrix quality where the semantic distribution is generated, for the given object list. In open-world environments, we evaluate the semantic distribution quality on a dataset of real-life scenes.

Generating semantic distributions in open-world environments:

Four examples from the evaluation dataset with the 2D probability distributions generated for SMS-E and CLIP. These heatmaps are red for high-probability regions of finding the target object and blue for low probability.

Top Left: An example of a grocery store, where the target object is “incense sticks.” CLIP highlights both near the candles and near the flowers as they are somewhat visually similar to sticks, while SMS-E only highlights the candles.

Bottom Left: An example of an office kitchen, where the target object is “cat food.” CLIP gets distracted by the refrigerator and only slightly highlights the cat sign.

Top Right: An example of a house, where the target object is “paddle.” CLIP incorrectly highlights the wooden panels along the walls, while SMS-E highlights the ping pong table.

Bottom Right: For the target word “microphone,” SMS-E highlights the box with the speaker but CLIP struggles as the objects are not visually similar.

Given a semantic distribution, we investigate the question (1) by conducting simulation and real experiments in close-world environments to evaluate search performance improvement brought by the semantic distribution. We combine the semantic distribution with an existing mechanical search policy LAX-RAY. 

Simulation experiments on the top right and real world shelf execution on the bottom right.

BibTeX

@inproceedings{sharma2023semantic,title={Semantic Mechanical Search with Large Vision and Language Models},author={Satvik Sharma and Huang Huang and Kaushik Shivakumar and Lawrence Yunliang Chen and Ryan Hoque and brian ichter and Ken Goldberg},booktitle={7th Annual Conference on Robot Learning},year={2023},url={https://openreview.net/forum?id=vsEWu6mMUhB}}