Autonomous driving technology has been advancing rapidly, revolutionizing various fields. From Tesla's success in enabling hands-free driving to autonomous drones excelling in search operations and SLAM, the progress is remarkable. At NYU Self-Drive, we strive to deepen our understanding of both traditional computer vision algorithms and machine learning techniques. We participate in an annual internal competition tackling diverse challenges and are proud to host the ICRA Self-Driving Competition next year.
Our current focus is on developing efficient algorithms for the TurtleBot, with only monocular camera and no additional sensors, to navigate through a maze marked with distinct images. The goal is for the TurtleBot to accurately interpret the unique features of each image, match it with a given test image, which is identical to one of the image in the maze but with different orientation, light, or parts blocked. The TurtleBot is then expected to navigate using the shortest path to the image's location with precision.
Here is the outlook of the maze that we work in. We see that inside the maze, there are numerous carboard each with a image. Those images are from different themes picked by us. This maze is a simplified version of the world environment that has clear border and evident image features. However, by gaining understanding in customizing exploration and navigation schemes, we aim to build better application for real world autonomous golfcart that we built in the past. On the right is video that explain our team's purposes.
For exploration, we choose the SIFT algorithm for extracting features in the images and matching. The Scale-Invariant Feature Transform (SIFT) algorithm is a powerful technique in computer vision for detecting and describing local features in images. It operates in several steps to ensure robustness against scale, rotation, and illumination changes. First, it identifies key points by constructing a scale-space representation of the image using Gaussian blurring and detecting extrema in the Difference of Gaussian (DoG) images. These key points are then refined to improve localization and eliminate unstable points. Each key point is assigned an orientation based on the gradient directions of its surrounding pixels, ensuring rotation invariance. Finally, a feature descriptor is generated for each key point by creating a histogram of gradients within a local neighborhood, producing a unique and compact representation of the image's features.
Image on the left shows how change in sigma (blurring factor) result in different gradient spike for 2 different Blob shape. Image on the right shows the result of SIFT matching of a book in different direction. SIFT counts the dominant direction vector therefore it has knowledge of the orientation of identical object. Both images are from "First Principal of Computer Science Youtube Channel".
Below shows the result of SIFT codebook successfully matches the maze image with the target image. We use OpenCV's SIFT function to achieve this. We see that the target image is completely rotated compared to original maze image. The SIFT algorithms is robust enough to identify them.
Since the TurtleBot has only a monocular camera with no sensor, doing SLAM is quit hard. For our baseline, we adapt the odometry of TurtleBot, combined with Kalman filter, we can perform a time consuming yet working navigation scheme.
The Kalman filter is an algorithm for estimating the state of a dynamic system in the presence of noise and uncertainty. It operates in two steps: prediction, where it estimates the current state using a mathematical model, and update, where it incorporates new noisy measurements to refine the estimate. By balancing the trust in the model and the measurements using a calculated Kalman gain, it provides accurate and efficient state estimates in real-time. Widely used in applications like robotics, navigation, and sensor fusion, the Kalman filter excels in handling noisy or incomplete data.
Below are image showing aspects of Kalman filter obtained from the internet. This left image shows the Kalman filter process, where the initial estimate is refined by combining the predicted state with a noisy measurement to produce an optimal state estimate with reduced uncertainty. The right image shows a car go back to its starting point by monitoring itself using Kalman.
Here is a flowchart that shows how we complete the challenge. From controlling the TurtleBot to exploration to navigation.
We utilizes several ROS pacakges to control the TurtleBot's driving system and camera. Here is a diagram showing how we connect our host to control TurtleBot through SSH. We coded a control system for TurtleBot. During the exploration, the camera of the TurtleBot is turned on, and we manually drive the TurtleBot through keys on our computer as below shows.
Then, we record 2 things as we go; first, the camera data, obviously for later feature matching. Second, we record the velocity data at each time stamp through a JSON file. After exploration finished, we start navigation. If we successfully find the image using SIFT, we know 2 things of that image: its time stamp since we record JSON file for every image our camera takes, and second, all the velocities that leads the TurtleBot to that image. So, to navigate our TurtleBot to the image location, we simply retrace every velocities in every time stamp before that image.
Here is the structure of the code that reflects our flowchart.
Inside dataset, we have our images taken by the camera in certain frame rate, which stored as demo_1 and demo_2. We also have annotation JSON file, which stores the velocity of TurtleBot for each image. So len(annotation) = len(images). The image 0.jpg under test is the reference image we want to match. A successful match means we find the image from 'images' folder that is identical to 0.jpg under test.
Below is a visualization of the traces that our TurtleBot takes, using Kalman filter and velocity information.
Here is a successful navigation video.
Here are new updates:
A big problem with our exploration scheme is that there is a limit to SIFT algorithms. If the image on the board get extremely complex, the cluster size of SIFT needs to be enlarge and the calculation might take tremendous time. For navigation, we are not really finding the shortest path but rather retracing all the time stamp before match. Therefore, for exploration, we hope we can try out different machine learning algorithms, like YOLO, and try for more complex feature matching. For navigation, I plan to use interactive border detection and visual odometry to use images to deduce motion and position alone with Kalman filter. This way, we can do a SLAM with a simple map that covers the grounds and location of features.