Computer vision based object manipulation:
In order for the robot to detect common kitchen utensils and more specifically, the pose that the robot is required to have in order to interact with the object, we plan to use a computer vision based approach for a fully autonomous operation. Common vision models such as YOLO can give a 2D estimation of what and where an object is using a bounding box, but doesn’t fully provide the 3D position/rotation information. However, with some prior knowledge of what the objects are and what height, width, or radius they have, it’s possible to estimate its 3D position information using learning-free computer vision techniques. We plan to highly base our method on this Github repository we found that demonstrates visual servoing for a cube with a ArUco maker or a tennis ball. And in order for this to work, we plan to hardcode some common kitchen objects such as cups or plates with a certain size.
Manipulation:
In order for the robot to stably pick up specific items in the kitchen (e.g. spoons, forks, salt shakers, cooking materials) and stably place them at a user-specified location, the various sizes and shapes of the items must be considered. In order to stably pick up thin and small items that are difficult to pick up with the gripper that the current robot has, a small gripper can be created using 3D printing and installed or an adhesive material can be attached to enable lifting of small objects. When placing an object at a user-specified location, the path must be planned and then placed to minimize interference with surrounding objects.
Interaction with user:
The robot communicates with the blind user through voice. When the robot receives a voice command from the user to find an object and return it to its original place, the robot identifies the target object based on the recognized words (e.g. spoons, forks, salt shakers, cooking ingredients) and asks the user a question to confirm whether the requested object is correct. Before starting the search, the robot announces the start of the task via voice to ensure that the user is not in the kitchen. During the task, if a person is detected in the kitchen, the robot informs them that a task is in progress so they can ensure their safety. After the object is returned to its original place, the robot announces to the user via voice that the task is complete.
If the robot does not detect an object or cannot return to a landmark during the entire search of the kitchen, the robot informs the user of the situation through voice and requests assistance from the user. If the robot does not understand the user’s command or if the command is beyond the robot’s capabilities, it notifies the user.
Environment:
In order for the robot to navigate and manipulate objects smoothly, the robot’s path and landmarks must be kept clean.
The entire kitchen must have enough space for the robot to move, and any obstacles in its path must be removed before the robot can work.
The landmarks must be places that the robot’s camera can detect, and ArUco Markers are attached to the landmarks so that the robot can clearly recognize where to put the object.
There must be no liquid around the robot, so any liquid in the robot’s path must be removed before work.
ArUco Markers:
As an alternative perception method to enhance reliability of the system, we propose attaching ArUco markers to both the target objects and their designated placement locations in the environment. For each object the robot needs to manipulate, one marker will be affixed directly onto the object, and a second marker will be placed at the object's typical resting location - determined based on common placement habits of blind users (e.g., mugs near the sink or toothbrushes by the bathroom counter). This enables the robot to detect and distinguish between different objects using unique marker IDs, identify which item to grasp, and recognize the correct spot for returning it. This method is feasible as it requires minimal hardware modification and can be implemented in structured home environments with moderate setup effort.
Human-in-the-loop/interactive perception:
- In order to perceive an object, a human can look through the camera and select the part of the screen where the object is, directing the robot towards it in the process. The robot would spin slowly around in a circle, allowing the human to see all potential places where the object may be. After grabbing the object, the robot will notify the human and spin around, giving them a chance to identify where to put the object after it has been retrieved.
- The human would most likely be a caretaker who is remote (since otherwise, they could just grab the object). This is feasible, since it would be hard to identify the object otherwise without the help of the robot, and it takes little effort to point out where the object is and where to put it.