S4: Semantic Spatial Search in Shelves
Satvik Sharma, Kaushik Shivakumar, Huang Huang, Ryan Hoque, Alishba Imran, Brian Ichter, Ken Goldberg
Satvik Sharma, Kaushik Shivakumar, Huang Huang, Ryan Hoque, Alishba Imran, Brian Ichter, Ken Goldberg
How can a robot efficiently extract a desired object from a shelf when it is fully occluded by other objects? Prior works propose geometric approaches for this problem but do not consider object semantics. Shelves in pharmacies, restaurant kitchens, and grocery stores are often organized such that semantically similar objects are placed close to one another. Can large language models (LLMs) serve as semantic knowledge sources to accelerate robotic mechanical search in semantically arranged environments? With Semantic Spatial Search on Shelves (S^4), we use LLMs to generate affinity matrices, where entries correspond to semantic likelihood of physical proximity between objects. We derive semantic spatial distributions by synthesizing semantics with learned geometric constraints. S^4 incorporates Optical Character Recognition (OCR) and semantic refinement with predictions from ViLD, an open-vocabulary object detection model. Simulation experiments suggest that semantic spatial search reduces the search time relative to pure spatial search by an average of 24% across three domains: pharmacy, kitchen, and office shelves. A manually collected dataset of 100 semantic scenes suggests that OCR and semantic refinement improve object detection accuracy by 35%. Lastly, physical experiments in a pharmacy shelf suggest 47.1% improvement over pure spatial search.
Affinity Matrix Generation
Object Detection Refinement
Semantic Occupancy Distribution
Semantic Spatial Search on Shelves (S4 ) Algorithm
System overview of Semantic Spatial Search on Shelves (S^4). The affinity matrix is computed offline. Given an RGBD image, we use object detection combined with refinement to query the affinity matrix and construct a semantic occupancy distribution. We multiply this by a spatial occupancy distribution to use in a mechanical search policy.
We observe that S^4 significantly accelerates mechanical search, reducing the average number of actions by 47.1%. In physical experiments, the depth image has noise while in simulation we have ground truth depth information. This results in the spatial distribution in simulation being strictly better than the spatial distribution in real. This discrepancy between the quality of the spatial distribution makes the semantic distribution more critical in identifying where a target object may lie in physical experiments. Thus, S^4 (PaLM) outperforms the spatial distribution by a larger margin, 47.1%, in physical experiments compared to 32.5% in the simulated pharmacy domain.
Overall, the results suggest that S^4 (PaLM) can accelerate mechanical search compared to the spatial distribution in semantically arranged environments by 32.4%, 27.1%, and 12.3% in the pharmacy, kitchen, and office domains respectively, while improving success rates.