Retrieving three dimensional information about the environment is important for object detection in advanced robotic systems such as self-driving vehicles, virtual reality, and robotic grasping. Current methods involve advanced sensors such as LIDAR, require the use of multiple cameras for stereo location, or use advanced machine learning algorithms. While these methods are powerful, they are expensive (both computationally and monetarily) or less compact due to the requirement for multiple camera angles. For light applications such as a mobile robot that plays catch, a more cost-effective option is desired. While a two-dimensional image may be useful for tracking a ball or a human in horizontal and vertical directions, playing catch also requires depth information or how far an object is away from the camera. The question this research will explore is as follows: is it possible to effectively extract depth information from a single low cost web camera image? While this topic has been explored, the results are generally inaccurate and thus not reliable for object depth detection. In relation to the scope of this course, the goal of this research will be to use a single camera image to gain accurate depth information about a tossed beach ball or person to be used to successfully play a game of catch.
Various components of object tracking have been extensively studied. Relating to this project and research topic specifically is the circular or adaptive Hough transform [1]. The circular Hough transform is frequently used in digital image processing to detect circular objects. Typically, a raw digital image is converted to a grayscale or binary image and then processed through an edge detector. The Hough transform can then be applied on the processed image to detect circles. Essentially, the Hough transform takes points in an image, converts them to a parameter space, and accumulates values (depending on the desired shape like a line, circle, etc.) to determine where in the image the desired shapes are.
To gain depth information, stereo techniques are relatively simple to implement and have been extensively researched [2, 3]. The essence of stereo imaging is that depth information can be extracted from two (or more) images of the same scene by determining the relative shift in points of interest in the two images. This passive technique does not rely on infrared or laser signals only camera images.
A relatively new technique uses only one image to gain depth information. One group trained a convolutional neural network (CNN) to perform single image depth estimation [4]. This technique does not require ground truth data and performs well in comparison to other machine learning algorithms.
While the adaptive Hough transform could prove to be useful for ball tracking, this provides no depth information on the two-dimensional image. Stereo imaging is one technique that yields depth information but two separate cameras are required. We wish to restrict our research to a single camera image. Finally, as powerful as machine learning algorithms are for depth imaging, the large training time and access to large amounts of training data are unreasonable for the small scale project we are developing. Instead, we wish to implement a simple non-machine learning technique that will provide depth information from a single camera image.
The solution we will explore to develop a simple single camera depth estimator will be to provide a baseline image of an object of interest (i.e. a beach ball) at a known distance (or depth) from the camera. For this baseline image we utilize the adaptive Hough transform along with color masking to locate the circular object in the image and note the known depth of the ball and relative diameter of the ball. Using the baseline image, we compute the focal length and retrieve the depth of a new image based on the relative size of the beach ball in the new image and the computed focal length. This requires a single camera and one initial data point.
The experimental setup is shown in Figure 1 where blue markers are placed at one foot increments from the laptop camera. The baseline object was then held at the various markers and Figure 2 shows an example output from the algorithm. Evaluation of the developed algorithm is fairly straightforward. For testing purposes, we placed a similar sized object compared to the beach ball composed of a single color at a known distance from the camera and compared this to the reading from our algorithm.
Figure 1. Preliminary experimental setup for gaining depth information from a single camera image. The blue markers are placed at one foot increments from the camera. A single colored object roughly the same size as the fully inflated beach ball was used for initial data collection.
Figure 2. Example data collection for gaining depth information from a single camera image. The tan object is the object of interest. In this example, the object is being held at three, six, nine, and twelve feet from the camera. The measured distance from the algorithm is displayed in the lower right corner.
Preliminary data was collected regarding depth information from a single camera image. Three data points were collected for each known distance at one foot increments from 0 to 15 feet as this is approximately the longest distance the ball will be from the robot. The average result and standard deviation is displayed in Table 1 and Figure 3. Also shown is the expected output from the algorithm. Notice that the preliminary results do not match the ideal linear behavior. The observed output falls off the farther the object is from the camera.
The evaluation criteria and data discussed above illustrate that gaining depth information from a single camera image is possible. However, accuracy is clearly an issue. Notice that at 15 feet, the measured output is over a foot off with a significant error. Ideally, for the range of 0 to 15 feet, the measured output from the algorithm should match the actual distance from the camera.
New data was collected regarding depth information from a single camera image. To improve the accuracy of the distance readings, we used a linear scaling factor of 0.9293 that was determined from preliminary data. The same experimental setup was used to collect data (see Figure 1). Three data points were collected at one foot increments from 0 to 15 feet. The average result (with standard deviation for error bars) is displayed in Table 1 above and Figure 4 below. Also shown is the expected output from the algorithm. Notice that the current state of the algorithm closely matches the ideal linear behavior.
Table 1. Data for obtaining depth information from a single camera image.
Figure 3. Preliminary data collected for gaining depth information from a single camera image. The dashed line is the expected output distance. The solid line shows the observed experimental output. Notice that the observed output is lower than the expected for larger distances. A best-fit linear model is displayed that crosses the origin.
Figure 4. Data collected for gaining depth information from a single camera image. The dashed line is the expected output distance. The solid dots are the measured output from the depth algorithm. Notice that the observed output closely matches the expected output. Error bars shown are the standard deviation of three separate measurements.
In this section we discuss differences in the preliminary and final data and describe our contributions to the research question. Clearly there are differences between the preliminary data (Figure 3) and final data (Figure 4). The preliminary data was not accurate for larger distances. By extracting a scaling factor from the preliminary to adjust the depth algorithm, we were able to produce new data that more closely matches the actual distances. However, both sets of data have larger variances for larger distances from the camera. This means that these algorithms break down for distances that are far away from the camera. This is due to pixel sizes increasing during the color masking process.
The evaluation criteria for this project are straightforward: comparing measured distance to actual distance. Clearly, we were able to extract depth from a single camera image without extensive machine learning algorithms or expensive equipment like LIDAR or multiple cameras. In this respect, we were able to contribute to the research question by finding a low cost method to gain depth information from a single image. That method is tracking an object by color and using the triangle similarity between a baseline image and new image of interest. However, tracking multi-colored objects with the developed algorithm is more difficult and requires more tuning. In terms of implementing this algorithm on CatchBot, we used this method to gain depth information of humans wearing AR tags. We were able to successfully differentiate between the different zones and throw the ball at the appropriate speed. In summary, we were able to produce promising data and contribute to the research question developed in this report.
Moving forward, the next logical step for this research topic will be to track the multi-colored beach ball as this is the object of interest while playing catch. This will involve modifying the algorithm to create a binary mask from multiple color ranges and compute the largest enclosing circle from this contour. After the algorithm can successfully detect the beach ball, a new baseline image will need to be captured. Next, a new experimental setup will need to be constructed to collect new data based on the beach ball. Work has already been done on creating an appropriate mask for the beach ball and can be seen in Figure 5. To produce this result, I tested various color ranges for pink, orange, and blue in Kelley and created a unified binary mask with these color ranges. However, a challenge still remains to successfully track the beach ball. One issue I have encountered is determining the minimum enclosing circle of multiple contours in a binary image.
There are also aspects of this research that can be performed outside the scope of this course. The major topic to be explored is different object detection techniques other than color. That is, edge detection using a circular Hough transform may be a viable option for tracking the beach ball. In other words, there are other options for detecting objects in a camera image that can be explored and incorporated into the developed algorithm.
Figure 5. Ball detection based on three colors (pink, orange, and blue). Binary mask is shown and background removal was applied. Some background blobs still remain and may require more opening and closing operations.
[1] John Illingworth and Josef Kittler. A survey of the hough transform. Computer vision, graphics, and image processing, 44(1):87–116, 1988.
[2] Daniel Scharstein and Richard Szeliski. High-accuracy stereo depth maps using structured light. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2003.
[3] AN Rajagopalan, Subhasis Chaudhuri, and Uma Mudenagudi. Depth estimation and image restoration using defocused stereo pairs. IEEE transactions on pattern analysis and machine intelligence, 26(11):1521–1525, 2004.
[4] Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.