Abstract
Depth sensors, nowadays, are crucial to many applications, not only in the robotics field. There are two distinct technologies – Structured Light and Time-of-Flight - in which these kind of sensors are based, and, since the first one aimed at the general public was released in 2010, there has been a continuous evolution of the solutions available.
This paper provides a quantitative benchmark for the devices tested, as it compares the Microsoft Kinect v1 and the Intel RealSense L515 depth sensors, following the closest there is to an international recognizable standard: the VDI/VDE 2634 – Part 2 norm, assessing different errors in the 500-1000mm range.
Introduction
In a ten-year period, three-dimensional optical measuring systems have increased their popularity and their span of applications in several industries, ranging from automotive and aerospace to biomedical, including naturally robotics. Typically, depth data were obtained using stereoscopic systems with complex software implementations to combine the different image feed and extrapolate this data. With the advent and continuous development of structured light and time-of-flight (ToF) technologies, powerful infra-red (IR) LASER sensors have become available with increasing resolution (and accuracy) and diminishing costs and size over the years.
Microsoft Kinect version 1 was presented in 2009 and was released to the public in 2010 - Kinect for Xbox, and a year later came Kinect for PC -, marking a new generation of affordable depth sensors and allowing several developments to be made, as the possibilities it opened for enthusiasts were immense: it was possible to interface the camera with Microsoft’s Visual Studio software development kit (SDK) to suit a vast range of application. Consequently, although primarily aimed at gaming, several areas experimented with this product – an interesting example is, for instance, a pilot study in Apnoea Monitoring of Children (1). Robotics also benefited very much from this technological development, enhancing several applications related with control technology such as 3D-mapping and obstacle avoidance (2, 3), keen in self-driving vehicles implementations.
This first Kinect used structured light technology, projecting an IR dot pattern onto a scene and assessing the depth information using an IR camera, discretizing it in a 2048-level 320x240 pixels depth map. Later, in 2013, came Microsoft Kinect version 2 using ToF technology and both have already been extensively compared to one another (4-6).
In April 2020, Intel released the latest of its depth-sensing cameras, the RealSense L515 (“L” stands for LiDAR, which, in turns, is the acronym of “light detection and ranging”), which strengthened their powerful line-up that saw its first model launched in 2015, having, ever since, been among the favourites for interactive artificial intelligence, security and human interaction design. This popularity is mainly due to its open source supporting SDK librealsense, optimized for software developers that only need to deal with their implementation and, thus, to apply the camera to real-life problems. The librealsense library, currently in the version 2.0, provides cross-platform support for the multitude of RealSense depth cameras and trackers, offering high- and low-level APIs and integration with MATLAB, OpenCV, ROS and Unity. Debug tools and viewers for both the depth and RGB stream are available as well.
Boasting a depth sensor with different resolutions up to 1024x768 px, using ToF technology, and a RGB sensor with FullHD stream (1920x1080 px), the Intel L515 camera is the state of the art for low-cost consumer- and professional- grade depth cameras (7). It represents a ten-year evolution of depth-sensing technology.
The aim of this paper is to compare both sensors introduced, assessing the development of optical depth-measuring sensors over the ten-year period mentioned. This way, it follows the German VDI/VDE 2634 Part 2 standard, as there is no international recognized standard for depth sensors comprehensive testing. Although additional tests may be needed to completely characterize a camera, the ones conducted provide a good basis for comparison of the Kinect v1 and the Intel L515.
The paper is organized as follows. Section 2 verses on the different existing technologies – structured light vs ToF – and the different point cloud processing algorithms. Section 3 mentions previous related work found in the literature. Section 4 presents a more in-depth view and comparison of both cameras. Section 5 introduces the experiments that were carried out, describing the test setup and the procedure followed. In section 6, the results are presented comparing both sensors; and, finally, in section 7, the conclusions are summarily presented.
There are two main technologies adopted for depth-measuring. Here, an overview is made of both.
Structured-light-based sensors use triangulation methods to determine the depth information of a scene. By analysing the scene from different perspectives, using one or more cameras alone or coupled with a projector, it is possible to triangulate the range coordinate of the scene (8-10).
Contrary to stereoscopy methods, which rely on two cameras for comparing data on the region of interest, structured light sensors use a single camera and a laser projector which emits a codified pattern of light that is projected onto a surface or object to be measured. As the projected pattern of light is structured in a way to best represent different topological changes and is known beforehand by the camera in use, instead of a second camera, trigonometry is utilized to calculate the depth distance of the scene by assessing the deformations suffered by the structured pattern of light (8, 11).
Typically, powerful laser IR sensors are used, and it is possible to reach sub-millimetre accuracy.
There are several Structured Light implementations, namely Time-multiplexing – the scene is split is several areas of interest by the projected pattern; is it not suitable for dynamic shape reconstruction -, Direct coding – using a colour coded/grey-shaded pattern, it allows for depth information to be obtained with just one frame, but can be very sensitive to ambient lighting conditions -; and, finally, Spatial neighbourhood – a spatially-structured pattern is projected onto a scene, creating uniqueness in the neighbourhood of each pixel. The last one is definitely the most common among current depth sensors (8).
B- Time-of-Flight Technology
The Time-of-Flight technology uses a different approach: directly estimating the distance to the target; well-known examples of this technology are SONAR and RADAR, but there is also LiDAR, which utilizes light to calculate distance (depending on the power and wavelength of the light, time of flight sensors can measure depth at significant distances – over tens of meters) (8, 10).
Hence, these systems consist of a light transmitter (illumination block) and a receiver: the transmitter emits a modulated signal that bounces off objects in the region of interest (ROI) and is reflected back to the receiver (a sensor which consists of a matrix of pixels that collects light from the ROI). To reconstruct 3D environments Laser Scanners are typically used, orienting a laser beam around two angles. By evaluating the phase shift between emission and reception (the speed of light is a known constant), if the signal is periodic, the distance to the scene can be accurately calculated. Two modulations can be used in order to apply ToF principles: the Pulsed-modulation and the Continuous-wave modulation.
The first one is direct and easy to implement. It requires very short light pulses with fast rise- and fall-times and high optical power, providing great implementation efficiency. The second one yields better results; it uses a cross correlation operation between both signals to derive an estimate of the phase shift, and, because it considers data from multiple samples, the precision of the estimation is improved. Both modulations are prone to errors in the phase estimation uncertainty, thus advanced systems use multi-frequency scanning, increasing the precision of the depth data (8).
The main disadvantage of this technology is its susceptibility to be affected by other sources of light, particularly those which emit in the same wavelength, impairing their performance when mounted in arrays (8).
The later Kinect version 2 was developed using a ToF sensor and so was the Intel RealSense L515, being a compact high-resolution depth camera. ToF-based devices give generally more precise and accurate depth information when compared to their structured-light counterparts (4, 6).
(6) characterized extensively the Microsoft Kinect v2, the Orbbec Astra S and the Intel D400 series sensors. Both pixel-wise – accuracy of the sensor at different distances - and sensor-wise – quality of the reconstruction of a known 3D object - characterization experiments were conducted. The latter, in particular, was based on simple geometric solids, namely, a cylinder, a sphere and a plane, and the distances between the points acquired and characteristic dimensions were compared to the real ones.
The tests showed that the best results for all cameras – more accurate reconstruction of a 3D object and better pixel-wise accuracy- were obtained at closest working range, as anticipated.
Carfagni et al. (4) characterized the Intel RealSense D415, further comparing its performance to that of the RealSense SR300, the PrimeSense Carmine 1.09 and Kinect v2. A thorough test was conducted, being assessed the response of the D415 sensor in different ranges: first, in the 150-500mm range – by means of a calibrated-diameter sphere set at increasing distances from the sensor, which in turn was tested with various laser power-levels -, then in the 500-1000mm range – following the VDI/VDE 1634 Part 2 standard, which is also followed in this paper and is explained in the next sections -; and, finally, reconstructing a complex 3D object – a statue. The D415 proved very accurate, either in the close-range tests either in the long-range, and, as suggested by the authors, although aimed at gaming and gesture recognition, it could also be used as a 3D scanner.
Pinto et al. (5) also compared several sensors, namely, both versions of Kinect (v1: structured-light; v2: ToF), Structure Sensor (structured light) and the SwissRanger SR4000 (ToF). Random Sample Consensus (RANSAC) algorithm were implemented to best fit different-coloured planes (green and red) to a series of depth-data, under the form of a point cloud. The results showed that the errors in acquiring the distance to a plane increased with the distance to the latter for the Kinect v1 and the Structure sensor, while remaining approximately constant for the others bases on ToF technology.
Devices Comparison
A. Microsoft Kinect v1
Featuring Structured Light technology, The Kinect version 1 (Figure 1) – in this case, Kinect for Windows - was released in February 2011, being, nowadays, naturally out-of-date: in fact, Microsoft has ceased to release new drivers and SDK updates for this sensor. It also boasts motion-tracking features, being able to recognize up to six standing or sitting people if in its field-of-view (FOV), by discretizing and tracking 20 skeletal joints (12). It possesses a RGB camera as well (Table 1).
B. Intel RealSense L515
The Intel RealSense L515 (Figure 2) is a state-of-the-art camera, featuring a FullHD-capable RGB camera and a HD depth sensor, all this encapsulated in a 61mm-diameter disc with a height of 26mm, being, thus, significantly more compact than the Kinect sensor (Table 1).
Not only is its range larger - beginning at 0.25m an up to 9m (for conditions characterized by 95% reflectivity) or up to 3.9m (15% reflectivity) -, but also its FOV is greater to that of Kinect - 70°*55° and 57°*43° respectively -, which leads inevitably to a larger working volume.
Following the VDI/VDE 2634 Part 2 standard (4) was used to characterize both sensors in the 500-1000mm range. It cannot be applied nearer, because the minimum working volume required by the standard would not be available, as it decreases when the target object is set closer to the sensor.
The norm states fundamental parameters for the different tests, namely: the probing error, P - it defines the characteristic error of the system over a small part of the working volume -, the sphere spacing error, SS - the ability of the system to accurately determine lengths -; and flatness measurement error, F - range of the deviations of the measured points from the best-fit plane. This way, to assess these errors and carry out the measurements a sphere, a ball bar and a flat object are used. The dimensions of each of these objects are calibrated beforehand, guaranteeing accuracy until 0.1mm.
In the standard, a fundamental parameter was introduced to define the characteristic dimensions of the objects: the length of the diagonal, L0, of the sensor’s conic field of view (thus, a truncated pyramid) - with its vertex on the sensor origin and extending from it 500 to 1000mm (4).
The results are considered for the characterization only if the error lies inside the tolerances defined by the manufacturers.
MATLAB was used to process all the point cloud information retrieved from both devices; it was also used to read information directly from the Kinect device, as it possessed suitable drivers. However, due to the novelty of the L515 sensor (late April 2020), at the time of writing, there are no drivers available to interface the Intel sensor, hence the need to obtain the depth information using another software, in this case, RTAB for Windows.
A RANdom SAmple Consensus (RANSAC) algorithm was used to determine the best-fitting sphere/plane to the point clouds collected. This type of algorithms detects outliers in data, attributing them no influence in the overall data fitting, contrarily to what happens with a least-squares fit (14).
First, the characteristic dimension of the working volume of each sensor was determined. The Kinect v1 has a FOV of 57° x 43° and the Intel L515 of 70° x 55° which translate into values of L0 of 1202,5 mm and 1504mm, respectively.
The sphere – probing error characterization - must have a diameter between 0.1 and 0.2 times L0. The distance between the centres of the two spheres of the ball-bar (sphere spacing error) must be no less than 0.3 times L0. Finally, the plane - flatness error - must not be shorter than 0.5 times L0. These calculations and the dimensions of the chosen objects (Figure 3) are presented in the following table.
As stated in the previous section, the following errors (4) need to be evaluated:
1. The form probing error (PF) evaluates the best-fit sphere approximation (using the least-squares method), comparing the radial distances of the measured surface points of the ith sphere from the centre of the compensating element (Rimax and Rimin are the maximum and minimum distances to the centre of the best-fit sphere).