Team Members: Samuel Kinstlinger, Amanda Wang, Brandon Xu, Chloe Hallaert, Layla Johnson, Nevra Diker, Samuel Adenola
Faculty: Pratap Tokekar, Vishnu Sharma, Jingxi Chen
I4C Teaching Assistant: Anh Le
This project is about exploring solutions in navigation for robots in which they truly understand their environment as well as text/speech commands.
Can we create a ground robot that can scan its surroundings, detect objects, use those detected objects to classify a room, sense the boundaries of its environment, generate and remember a map of its environment, and follow navigational orders by storing the locations of the objects, rooms, and boundaries in memory?
Our goal is to achieve Robotic Semantic Navigation (RSN). Currently, robots do not truly understand the world/directions. However, we hope to make robots that can truly understand the semantics and meanings of different aspects of the world. Below are the three main goals we are trying to achieve in Robotic Semantic Navigation.
Before we can get started with an actual robot, we must simulate the algorithms and robot in action. This is done using a robotic simulator known as Robotic Operating System (ROS).
ROS (Robotic Operating System) is a group of software libraries used to simulate robots in a myriad of environments. We use ROS because it is open-source and lets us simulate robots before we create them in real life. Below is a label screenshot of our robot being simulated in ROS.
Below is a ROS command we wrote in the ROS Terminal/Shell that loads a simulation of our robot in a house.
The robot we are using in simulations and the robot we hope to implement our RSN algorithms in is called the TurtleBot2. The TurtleBot2 is a ground robot with a Yujin Kobuki base as well as sensors and a processor. The LIDAR (Light Detection and Ranging) sensor uses laser sensors to detect surroundings in a 270 degree range. The Stereo Vision sensor uses two image lenses to generate two 2D images. Then, the TurtleBot2 superimposes both 2D images to generate a 3D image and thus process the depth of objects within an image. The NVIDIA Jetson Processor stores and processes data from the LIDAR and Stereo Vision sensors and directs the wheels. Below is a labeled image of the TurtleBot2.
Now that we know the basics of the TurtleBot2 hardware, ROS, and RSN, let's delve a little bit deeper into how the Robot detects images, classifies rooms, and remembers the location and layout of objects, rooms, and the rest of its environment.
The TurtleBot2 takes in images as input as it is scanning its surrounding. In order to use the data from these images, they must first be processed. Each image is made up of a bunch of tiny colored pixels. Each pixel's color is determined by its red component, green component, and blue component. Each RGB component has an integer value between 0 and 255 determining its magnitude. For certain feature detection algorithms such as finding the brightness relations for different parts of the image, we do not need color. This is why, for certain algorithms, we turn those colored images into grayscale images which are easier to process. Each pixel in a grayscale image has a brightness integer value from 0 to 255, with 0 being pitch black and 255 being very bright. However, we still have a problem. Not all images have the same pixel width and height. Our robot will not work effectively if it learned from images of a wide variety of pixel sizes which were different than the pixel size of the images the TurtleBot2 inputs and processes. This is why when the images the robot was trained on were processed, something called an Image Pyramid was used to change the resolution (pixel width and height) of the image to match the resolution of the images TurtleBot2 inputs/processes.
The image below shows how images are made up of pixels.
The image below shows an Image Pyramid in action.
After an image is processed, the TurtleBot2 needs to find/localize the objects contained in the previously mentioned image.
To localize objects in an image, the TurtleBot2 uses an algorithm called YOLO (You Only Look Once). First, the YOLO algorithm divides images into different sections and makes each section its own image. Then, it uses an algorithm known as the Sliding Window Algorithm to incrementally search for objects within an image based on sudden color and brightness changes.
A visualization of the Sliding Window Algorithm searching an image for a face is shown below.
Next, the YOLO algorithm checks each localized image for a confidence level reflecting how certain the algorithm is that there is truly an object where it predicts there is an object. The objects with confidence levels over a certain threshold are labeled as definitely objects while all predicted objects with confidence levels below the aforementioned threshold are discarded. Finally, the YOLO algorithm separates each object into Anchor Boxes which extract the true shape of the object. The YOLO algorithm localized each object in an image so it can be classified.
An image with localized objects is shown below.
As previously mentioned, a machine learning model (complex equation) was shown a bunch of images as well as what the objects contained in the images were (Supervised Learning). This ML model then learned to recognize the features (edges, corners, brightness relations, color relations, etc...) of different objects. When the TurtleBot2 scans its surroundings, the localized objects are put through this model which matches the features of each localized object to objects it has already classified.
An visualization of the feature matching algorithm matching two different dogs is shown below.
When these algorithms are implemented, our robot can find the class/type of localized objects that it scanned. For objects that the algorithm is very confident that it knows the class of, the objects are said to definitely be what the ML Model predicted them to be.
A visual representation of this algorithm classifying a person and a chair in an image is shown below.
This whole process of processing an image, then localizing and classifying all of the objects within that image is known as Object Detection.
A visualization using animals to help you understand the different parts of object detection as well as object detection itself is shown below.
The TurtleBot2 uses the above processes to come full circle and achieve Robotic Semantic Navigation (RSN). The robot senses and remembers the locations of boundaries as gathered by its LIDAR sensor to generate a map of its environment.
This is shown below by two photos of the map in the TurtleBot2's memory. The first image is before it moves around/scans the house and the second is afterward.
The TurtleBot2 also remembers the locations of the objects it classifies, can identify rooms based on those objects (Ex. Sink + Toilet = Bathroom), and remembers the locations of the classified rooms and objects. These all allow the TurtleBot2 to sufficiently achieve the above goals and succeed in performing Robotic Semantic Navigation (RSN).
Our robot can...
Scan its environment for objects as it travels.
Use object labels/types to classify and label rooms.
Remember the layout and locations of rooms, objects, and landscapes.
Use all of the above to navigate to objects as commanded. (Ex. Go to the kitchen, find me a glass of water, find my keys, etc...)
https://drive.google.com/file/d/1cG49Q3WAU-d6P_bWwMQbTzbQEgfQJDZU/view?resourcekey
(Robot Demonstration in a Virtual Environment)
As you can see in the video, we are simulating a real-life scenario in which someone inside of their house tells a robot to go to the kitchen. We first told the robot to go to the kitchen. After this, the robot searched through its memory and found the exact location of the kitchen. Lastly, the robot used its map of the house's environment, which it had generated by scanning the house, to navigate to the kitchen.
Once we fine-tune/perfect the TurtleBot2, it can be used as a basis for ground manipulator (has claws, or arms, etc...) robots which can be used in virtually all fields. Many major world issues as well as minor inconveniences will be eliminated or minimized. Elderly people will be able to have robots retrieve their walkers for them and help them navigate to where they need to go. Hospitals, which are short on staff, will have robots that can retrieve vital supplies such as needles and defibrillators while still tending to their patients. Camp counselors will no longer have to walk all the way back to the bunkhouse because one kid forgot their sunscreen.