Unsupervised Object Discovery via Interaction

By: Daniel Barron, Robin Dumas, Bear Häon, Nakul Srikanth

Mentored By: Kaylene Stocking | View Code Repository

Demo.mp4

1. Introduction

Unsupervised Object Detection takes advantage of the fact that objects move independently of each other in order to identify objects without labels. Robots are valuable for this Computer Vision research pathway - as they are able to interact with objects in their environment - manipulating environment variables to understand their independent movements.
To support this research pathway, we developed:
1. A task environment for a robot to pickup and move objects to goal locations;
2. A central ROS-based algorithm to coordinate an environment engagement routine, deployed on a Quadruped Unmanned Ground Vehicle (UGV);
3. An extended Large Language Model (LLM) integration to advance associated research in robust, assured and trustworthy autonomy.
This research has notable real-world applications in fields where autonomous interaction and object manipulation are crucial. These include warehouse automation, search and rescue operations, and assistive technology for the differently-abled.

2. System Design

We selected the Unitree Go1 Quadraped Robot as the UGV to deploy our commanding algorithm for the interactive environment. The choice to use a Quadruped was made to facilitate agile movement in unstructured environments.

The UGV hardware's architecture is described in Figure 2A. In this research, we only used the UGV's front sensors for Point Cloud generation.

The UGV is controlled in velocity using custom ROS messages sent to the Raspberry Pi through UDP.

Figure 2A

The designed Task Environment (Figure 2B) is made up of 2 unidentified objects, 2 goal positions, and the interacting UGV.

The environment also includes a line of sight obstruction to facilitate for an experiment akin to the complexities of a real-world task setting.

In the exhaustive experiment, the UGV is tasked with locating unidentified objects, discerning if they are movable or immovable, moving any movable objects to a goal location, and checking in with a human for feedback on if the drop-off goal location, on arrival, was correct or incorrect. If incorrect, the UGV will relocate the movable object to the other goal, before recommencing the search for another object.

Figure 2B

To facilitate object pickup, we outfitted the UGV with a strip of Velcro attached to its chin (Figure 2C).

This choice was made in tandem with our object of choice, purple drinking cups, with context of the selected UGV's capable operating heights.

Specifically, the UGV was able to reach a height just below the height of a drinking cup. Fitting corresponding Velcro on cups that could move, and excluding velcro from cups that were intended to be 'immovable', facilitated a functional system whereby a UGV could, on its own, pickup an object, discern if it was moveable, and move the object if so.

Figure 2C

3. Implementation

Figure 3A

Our ROS architecture includes ROS running simultaneously on 3 devices: the UGV's Raspberry Pi (which acts as the ROS master), the UGV's main NVIDIA Jetson Nano, and an external computer. While the Nano connects to the Raspberry Pi via ethernet, the external computer is connected using Wi-Fi. The device on which each node is ran is indicated in Figure 3A.

AR Track Node: uses the AR Track Alvar ROS package to deduce the position of AR in 3D space using the camera's intrinsic parameters and the AR tags characteristics. Transformations are given in the UGV's 'camera_face' frame.

Goal Detector: reads the /tf topic and looks for the goal AR tags. The node remembers the last seen coordinates of these AR tags and continuously publishes them to the /goal1_center_coords and /goal2_center_coords topics.

Figure 3B

Point Cloud Node: automatically launched upon UGV startup. It uses the UGV's front sensors to generate and publish a sensor_msgs/PointCloud2 object as visualized in Figure 3B.

Relay Node: subscribes to the topics published by the main NVIDIA Jetson Nano and republishes them. This step is necessary for the external computer to be able to read these messages: while the external computer is connected to the Raspberry Pi via Wi-Fi, the Nano connects to a different port using ethernet. Hence, any message published by the Nano is not directly readable by the external computer.

Object Detector: processes the point cloud object to extract the XYZ coordinates and RGB channels of each point. After converting the RGB array to an OpenCV image, it uses an HSV filter to isolate the purple cup, and publishes the corresponding XYZ coordinates to the /cups/purple topic.

Figure 3C

Main Node: acts as the discrete state machine depicted in Figure 3C. The node subscribes to topics /cups/purple, /goal1_center_coords and /goal2_center_coords, stores and continuously updates the last known coordinates for the cup and each of the goals, and chooses what action the UGV should perform based on their values. In this implementation, looking for a cup or a goal simply consists in scanning the surroundings by rotating in place.

The UGV receives velocity inputs, but our program is made to control it in position, so we used a PID controller to determine the desired velocity command at each iteration of the control loop. All coordinates are given in the 'base' frame, which the UGV views as the world origin.

High Level Command Node: receives custom high level command ROS messages from the /high_cmd_to_robot topic and embeds them in UDP packets that are then sent to the Raspberry Pi.

completion = client.chat.completions. create (

model ='gpt-4 ',
messages = [{
'role': 'user',
'content': prompt }],

temperature =0.1,

top_p=0.1,
max_tokens =1500,
)

response = client.audio.speech.create(

model="tts-1",

voice="echo",

input=completion.choices[0].message.content

)

Figure 3D

LLM Integration:

To advance operator confidence in the safe deployment of autonomous systems, we have implemented a pipeline that enables the UGV to tell a story about its mission.

Descriptive Logging:

Each function in the main node has a english description
When a function is called, the description gets added to the log

GPT4 Integration:
- The log aggregate is then combined with our story prompt
- The combined prompt is then passed to GPT4 for it to generate a story
Text-to-Speech Integration:
- Import the voice model from Open AI and select voice type
- Turn the generated story into a MP3 audio file

Activating the Architecture

1. Connect to the unitree Wi-Fi

2. Connect to the Raspberry Pi and launch the Relay Node

code ssh pi@192.168.12.1 <pwd: 123>

python camera1node.py

3. On an external computer, launch the object and goal detection nodes (this assumes the camera used for AR tag tracking is already running)

Note: depending on the camera used for AR tag tracking, parameters in src/perception/launch/ar_track.launch might need to be ajusted

# Object detection

rosrun perception object_detector_pc.py

# Goal detection

roslaunch perception ar_track.launch

rosrun perception get_goal1_transform.py ar_tag_1

rosrun perception get_goal1_transform.py ar_tag_2

rosrun perception goal_detector.py

4. Still on the external computer, run the main ROS node

rosrun plannedctrl main_node.py

4. Results

Figure 4A

Figure 4B

Figure 4C

To successfully deliver our desired results we had to combine actuation, object classification, and goal feedback.

First, was to test our implementation schema in the three main actuations: navigating to a detected object (Figure 4A), picking up the object (Figure 4B), and taking the object to a detected goal (Figure 4C). The successful execution of these three actuations provides the necessary action space to navigate a room with multiple objects.

Next, we implemented the object classification. The classification of an object occurs when the UGV bows to pick up a cup, and then steps back to see if the cup is still there (first half of Figure 4C). If the cup is still there then it's classified as unmovable, if the cup is no longer detected then it is assumed to be picked up and is classified as movable.

Finally, we implemented the goal feedback. Part of our design requirements were to have the UGV ask the user if the reached goal is the correct goal. This is achieved by the following prompt occurring on the High Level Command node:

Goal reached!

Is this the right goal? [y/n]

If the current goal is not the correct goal, the UGV will begin searching for a new goal. If the current goal is the correct goal then it will wait for 5 seconds (for the user to remove the object), then it will begin to search for additional objects.

To successfully navigate the test room we successfully combined the actuations, object classification, and goal feedback. The full environment interplay is demonstrated in the demonstration video at the top of this page.

5. Conclusion

Figure 5A

Figure 5B

Figure 5C

We successfully demonstrated a UGV's ability to detect and interact with objects, achieving our primary research goals (Figure 5B). The additional integration of a LLM for communicating the UGV's actions further enhanced human-robot interaction capabilities.

We encountered technical challenges, particularly in hard-coded networking constraints on the onboard computing Pi and Nano units, as well as with respect to receiving reliable depth information from fish-eye based raw image feeds (Figure 5A). We overcame these challenges with novel technical solutions such as employing HSV detection on point cloud data and integrating an Intel RealSense camera.

Whilst our current implementation effectively functions in the designed task environment, it exhibits limitations in adaptability to diverse settings. The environment-specific programming restricts broader applicability, underlining the need to develop a more versatile ROS-based control algorithm (Figure 5C).

Future work includes refining the software to enhance environmental adaptability and task consistency. Specifically, we aim to develop algorithms that improve our actuation accuracy - as well as better utilize onboard cameras for simultaneous object and goal detection.

UC Berkeley Team

Daniel Baron

Master of Engineering student in EECS with a concentration in Robotics and Embedded Software, particularly interested in Generative AI and autonomous vehicles

Robin Dumas

Master of Engineering student in EECS with a concentration in Robotics and Embedded Software, particularly interested in autonomous vehicles and mobile legged robotics.

Bear Häon

Master of Science (DevEng) - EECS/MECENG concentration in Robotics and Autonomous Systems, supported as a Schmidt Futures Quad Fellow and National Science Foundation DToD Fellow.

Nakul Srikanth

Fourth-year undergraduate in EECS - interested in Robotics, AI/ML and Game Theory. Hoping to pursue a 5th year Master's Degree in EECS and further build on this research for his Masters Thesis.

Page updated

Report abuse