How often do you find yourself overwhelmed while grocery shopping, navigating aisles filled with countless choices, reading tiny labels, or struggling to find the best deal?
What if machine learning could step in as your personal shopping assistant?
Imagine stepping into a world where your surroundings come to life in real-time, with your AR glasses acting as the ultimate personal assistant.
These glasses can segment the shelves, highlighting vegan products, sale items, or ingredients you're allergic to. It’s like having a shopping buddy who knows your shopping list by heart, saving you time and keeping you healthy without any effort.
Let’s dive into how this futuristic shopping experience works by exploring different segmentation models: CNNs, Mask R-CNN, Vision Transformers, MaskFormer, and SAM.
But first, what is image segmentation?
Image segmentation is the process of dividing an image into distinct regions or segments, each corresponding to a meaningful object or scene element.
There are three main types of image segmentation -
1. Semantic Segmentation -
What? Assigns a label to each pixel in an image, indicating which object or class it belongs to.
Why? To categorize the entire image into meaningful regions.
Example - Segmenting a street scene image into different objects like roads, sidewalks, buildings, and vehicles.
2. Instance Segmentation -
What? Identifies and segments each individual instance of an object within an image.
Why? To distinguish individual objects of the same class.
Example - Segmenting an image of a crowded street to count and localize individual vehicles.
3. Panoptic Segmentation -
What? Combines semantic and instance segmentation to provide both object labels and instance-level identification.
Why? To create a dense segmentation map that labels each pixel with both a semantic class and an instance ID, allowing for accurate object counting and localization.
Example - Segmenting an image with 4 people and 4 glasses individually.
Convolutional Neural Networks (CNNs) - The Basic Grocery Scanner
Before the emergence of Vision Transformers, CNNs (U-Net) have been the go-to choice for image segmentation.
CNNs work by scanning small patches of the image, identifying patterns, and piecing them together to label items. They extract features from the region of interest (ROI) defined by the bounding box which is then fed into a fully convolutional network (FCN) to perform instance segmentation.
- Quick Identification: CNNs are fast and efficient at recognizing common items.
- Feature Extraction: Good for identifying objects or general categories.
- Lacks Precision: CNNs can tell you what’s on the shelf but can’t precisely pinpoint details.
- Spatial Limitation: Focuses on small receptive field, so it might miss the bigger picture.
Mask R-CNN - The Detailed Detective of Aisles
Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
Faster R-CNN is a region-based convolutional neural network, that returns bounding boxes for each object and its class label with a confidence score.
Mask R-CNN is the detailed detective that not only identifies items but also draws precise boundaries around them, highlighting products down to the exact spot on the shelf.
- Precise Segmentation: Draws accurate masks around items, making it easy to locate specific products.
- Handles Overlapping Items: Perfect for busy shelves where products are stacked closely.
- Resource-Intensive: Needs more computational power, making it slower for real-time processing.
- Complex Setup: Requires a more complex setup and tuning.
Vision Transformers (ViT) - The Big-Picture Grocery Expert
The idea behind ViT came from transformers used for text-based embeddings. Text is replaced by an image as a series of image patches which directly predicts class labels for the image.
Vision Transformers are like the shopping expert who understands the entire layout of the store. They analyze the whole scene, identifying relationships between items across aisles.
Next time, if you need to find the perfect combination of pasta, sauce, and cheese all at once, think of ViT!
- Global Context Awareness: Considers the entire image at once. For segmentation, this global perspective can lead to a more accurate delineation of objects.
- Highly Flexible: Adapts well to various types of scenes, simplifying the segmentation process.
- High Computational Cost: Requires a lot of resources, making it challenging to run on everyday AR devices.
- Sensitive to Training Quality: Needs extensive data and proper tuning to work effectively.
MaskFormer - The All-in-One Grocery Artist
The authors claim that “per-pixel classification is not enough”, so they invented a tri-module mask classification model which predicts a set of binary masks, each associated with a single global class label prediction.
The modules are pixel-level, transformer decoder and segmentation module. (Masking + Transformer = MaskFormer)
MaskFormer is like the all-in-one shopping artist who paints a complete picture, segmenting products while also understanding the scene's context.
It’s like having an assistant that could suggest you a meal plan based on the ingredients. MaskFormer combines object detection with segmentation, offering a unified approach that balances detail with efficiency.
- Unified Segmentation: Combines object and scene segmentation, making it highly versatile.
- Handles Complex Scenarios: Excels in environments with overlapping items and mixed contexts.
- Moderately Resource-Intensive: While not as demanding as Vision Transformers, still requires good computational power.
- May Miss Tiny Details: Can struggle with very small or intricate objects.
SAM (Segment Anything Model) - The Chameleon That Adapts to Any Shopping Need
Finally, SAM is like your ultimate assistant. It is primarily based on its revolutionary architecture - the image encoder, prompt encoder, and mask decoder.
The image encoder first creates features to analyze. The prompt encoder then adds context, focusing on the provided input. Finally, the mask decoder uses this combined information to segment the image accurately.
SAM can segment anything you encounter, from common groceries to niche products, making your shopping experience highly personalized.
- Highly Adaptive: Can segment any item with minimal retraining or prompts.
- Fast and Flexible: Designed to work efficiently across various environments and tasks.
- Generalist Approach: While versatile, it might not always be the best at very specific tasks.
- Needs Clear Prompts: Requires well-defined prompts to deliver precise segmentation, which can be tricky in ambiguous cases.
With segmentation models, your AR glasses aren’t just a fancy accessory—they’re the ultimate shopping assistant, turning a mundane task into a streamlined, personalized experience.
So next time you're in the store, imagine your AR glasses not just showing you what’s on the shelves but actively guiding you to make the best choices, saving you time and effort.
With segmentation models at your side, grocery shopping will never be the same again!
Get in touch at jain.van@northeastern.edu