Hybrid Retrieval of Structured Trajectories and Obstacles Using Vision-Language Models and Segmentation
This project aims to design a retrieval system that, given a query scene (partial trajectory + obstacles), finds the most structurally and visually similar stored scene from a database of generated trajectories.
Rather than relying solely on image appearance (which is dominated by large white backgrounds), we combine visual-semantic retrieval (using CLIP or DINO) with precise layout matching (using segmentation and obstacle structure extraction).
Typical visual encoders (like CLIP or DINO) struggle to differentiate scenes where:
Most of the image is white background,
Small obstacle variations matter greatly,
Precise layout (position and number of obstacles) is critical.
Thus, segmentation is introduced to extract the obstacle structures explicitly and guide retrieval based on layout similarity as well as visual similarity.
Data Generation:
Synthetic 2D images with random trajectories and randomly placed rectangular obstacles.
White background with obstacles plotted in gray.
Obstacle Segmentation (OpenCV):
Simple thresholding (cv2.threshold) separates obstacles (gray) from background (white).
Binary masks are created for each image.
Obstacle Feature Extraction:
Centers of obstacles are computed from segmentation masks.
For each image, a list of obstacle centers is stored.
Visual Embedding:
Either CLIP (ViT-L/14) or DINO (ViT-B/16) models are used to extract semantic embeddings of full images.
Embeddings are normalized and stored.
Layout Similarity Calculation:
For a query image, obstacle centers are extracted.
Center distance-based matching computes how close the layout is to each stored scene, normalized and penalized by obstacle count mismatch.
Hybrid Retrieval:
Final retrieval score = weighted sum of visual similarity (cosine similarity between embeddings) and layout similarity (center matching score).
The most similar scene is retrieved based on the final combined score.
Visualization:
The system displays both the query image and the retrieved most similar and least similar scenes for evaluation.