Accuracy: Identifying and classifying waste items to minimize sorting errors.
Flexibility: Ability to handle a wide range of waste types, including varying shapes, sizes, and materials.
Reliability: Consistent performance under different environmental conditions and usage scenarios.
Collision-Free Grasping: Approach items from above to minimize interference with neighboring objects.
Efficient Sorting: Quickly classify and move items to their proper bins.
Perception and Classification:
A RealSense camera captures images of the workspace.
These images are sent to a remote GPU-enabled environment (Google Colab) running an “open-vocabulary object detection” model. The model returns both a classification (recyclable or non-recyclable) and a 2D bounding box around the detected object.
Coordinate Transformation and Localization:
AR tags are placed in the scene to provide a reference for mapping 2D bounding box coordinates into the robot’s 3D workspace.
Instead of relying on a full depth measurement, we use a fixed height approach. The bounding box center is projected into the robot’s base frame coordinates at a predefined height (Z-position). This simplifies depth estimation and avoids complex calibration requirements for precise depth sensing.
Motion Strategy:
To safely approach the object, the robotic arm moves first above the item at a known safe height, ensuring it does not collide with other objects in the scene.
After positioning above the target, the arm moves straight down to grasp the item.
Sorting Action:
Once the item is grasped, the robot classifies it as recyclable or non-recyclable based on the model’s output.
The robot then moves the item to the corresponding bin for proper sorting.
Fixed Height vs. Depth Sensing:
We initially considered using Dex-Net for advanced grasp planning, but due to camera angle and inconsistent depth information, we opted for a fixed height pick strategy. This reduced complexity and increased reliability but limited dynamic adaptation to objects of varying heights.
Cloud Processing vs. Local Processing:
Using Google Colab’s GPU environment allowed us to run complex models without local hardware constraints. However, this introduced communication latency. We chose this approach to leverage state-of-the-art object detection models while accepting minor delays in classification.
AR Tags for Localization:
AR tags provided a straightforward method to establish a consistent reference frame. This eased the coordinate transformation process at the cost of needing additional calibration and ensuring tags remain visible and fixed in place.
Robustness: The fixed-height approach and AR tag-based localization simplify the system, reducing failure points.
Durability: Fewer complex calibration procedures mean less frequent adjustments, potentially improving long-term durability.
Efficiency: While waiting for remote classification slows down the process slightly, the simplified approach to grasping reduces mechanical and computational overhead, helping maintain consistent operation.