Robot Vision

Robot vision enables robots to perceive scene structure, depth, objects, and traversable space from visual input, providing essential information for perception, decision-making, and control. Our lab conducts research on robot vision with a focus on spatial perception and simulator construction for embodied and autonomous systems. In particular, SPACE-CLIP is a lightweight monocular depth estimation framework that recovers geometric cues directly from a frozen CLIP vision encoder, without requiring a separate heavy depth-specific backbone. This makes it possible to build a modular spatial perception block that can be integrated more easily into VLA models and robotic control pipelines.

We also study NVSim, a framework that automatically constructs large-scale indoor simulators and navigation graphs from ordinary traversal image sequences without expensive 3D scanning. Through floor-aware Gaussian Splatting and mesh-free traversability checking, NVSim generates cleaner floor representations and more reliable navigable space for real robotic navigation tasks.

Overall, our lab develops robot vision methods that improve depth understanding, spatial reasoning, and environment generation for more capable robotic perception and navigation.

NVSim is a framework that automatically builds large-scale, navigable indoor simulators from ordinary image sequences, without requiring expensive 3D scanning. It introduces Floor-Aware Gaussian Splatting to remove floor artifacts and produce cleaner, more navigable scene representations from robotic traversal data. The system also includes a mesh-free traversability checking method that directly analyzes rendered views to generate topological navigation graphs.

Overall, NVSim enables scalable creation of realistic indoor navigation environments from real-world data.

SPACE-CLIP is a lightweight monocular depth estimation framework that extracts geometric information directly from a frozen CLIP vision encoder without using text prompts at inference time. It uses a dual-pathway decoder that combines deep semantic features and shallow structural features to recover both global scene layout and fine local geometry. Under the text-free, frozen-backbone setting, it achieves strong results on KITTI and NYU Depth V2, while also transferring well to a frozen SigLIP backbone.

Overall, SPACE-CLIP shows that a compact decoder can turn a shared foundation-model backbone into a reusable spatial perception module for embodied AI and robotics.

Google Sites

Report abuse