Perception

Deep learning approach - SSD

Single Shot Detector (SSD) is a single-stage object detector which gives real-time performance. It is based on convolutional neural network (CNN) structure.

As opposed to region-based methods, which first generate region proposals and then detect the object of each proposal, SSD takes one single shot to detect multiple objects. The higher speed didn't come at the cost of reducing the accuracy - SSD, with 76.9% mAP at 22 FPS, outperforms Faster R-CNN (73.2% mAP at 7 FPS) and YOLOv1 (63.4 mAP at 45 FPS).

MobileNet V1 is used as a base of SSD. Additional convolutional layers which progressively decrease in size are added to the end of the base network.

Training details

Tensorflow Object Detection API + NVidia CUDA Toolkit

Model: ssd_mobilenet_v1_coco

Transfer learning: Model is pre-trained on the COCO dataset

the assumption is that lower layers (the ones closer to the inputs) have learned general features (lines, edges..) which are not specific to the COCO dataset.

Dataset:

synthetic dataset proceduraly generated in Blender (540 images)
manually labeled real images dataset (43 images + augmentation)
mainly trained on red bell peppers

good detection

bad detection

3D detection pipeline

Detection

inference on RGB image from RealSense D435 as an input (rate: ~15Hz)
extract bboxes of the peppers
filter organized point cloud with bbox to get the centroid of each pepper (some are NaN! - publishing them anyway, just so we know that the object is present)

match peppers detected from multiple frames using distances between the centroids
choose closest one among those visible from at least 2 frames
estimate orientation and size of the pepper so we can position for the grasping

Estimation

v1: single pointcloud - failed as surface of the pepper is reflective, there was no data in pc for some parts of the pepper
v2: estimation in 2D using RGB image - failed as surface of the pepper is reflective + servo does not reach expected perpendicular orientation
v3: estimation in 3D using multiple point clouds

Servo

- keep track of the closest pepper through the frames
- if the distance between the centroids of the closest pepper in two consecutive frames is above threshold, servo is aborted
- if object is visible, but centroid is NaN (which happens when the distance between the camera and the object is under 10-15cm), object position is set as the mean value of all of the previous measurements [todoAnalysis]

6DOF PEPPER POSE ESTIMATION

input:
- - - bboxes
    - point cloud (PC)
    - camera-world transform
- in several frames
organized PC is in camera frame, so following transformations can be applied:
- subcloud which contains pepper is extracted from the PC based on the bbox of the pepper
- points are filtered along local z axis --> only keep points at 5cm distance from the pepper centroid along z axis
transform filtered PC to the global frame
iterative closest point algorithm is used to incrementally register a series of pointclouds two by two
cylinder is segmented from the final model (length of such cylinder is infinite, pepper rotation = cylinder rotation, pepper radius = cylinder radius)
remaining points [~~todo:explain~~ -> points inside the cylinder] are projected onto a line (axis of the cylinder) in order to calculate the length of the pepper
pepper rotation, length and radius are returned to the control node

TODO: check how this works from greater distance

TODO: check how it works with occlusions

~~TODO: also use cylinder width as grasp input information~~

ready to grasp!

(based on 3d estimation)

icp registration

fitting a cylinder

segmentation and projection

final visualization

Analysis

Basic cases - without occlusions

Detected object position - search

max distance between the points: 1.98 cm

max distance between the points: 3.08 cm

max distance between the points: 2.9 cm

first 3 detections are from 'search', the others are from 'servo'

in general, position detected in servo is the same as position detected in 'search' state from which the trajectory is planned

Detected object position over time - servo

only detections are plotted (steps in which we used estimation based on previous measurements are ignored)
number of detections differs in trials because:
- we varied goal distance in the trajectory planning step
- number of steps in which we estimated object position (cos there was NaN in pc for pepper centroid) was different for different objects
when there are no occlusions, difference in detected object position during whole servo is under 1cm

in this case the output of the detection was almost binary
result of visualizing bbox of goal object for both cases can be seen in figures below
leaf was detected as pepper ---> TODO: consider lowering threshold distance between the objects in consecutive frames after analysing the cases with occlusions
130 detections because we opted for the leaf when pepper centroid was null instead of estimating based on previous detections

Pointcloud - reflections and NaNs

1) Object distance > RealSense threshold*

*RealSense threshold is around 15cm, there is no data in pc for objects with z < 15cm

pointcloud from 'search'
more representative than the ones from 'servo'
TODO: use only pc from 'search' for pepper pose estimation?

the closer we are, the less of the pepper is visible on pc
reflective surface of the pepper might be causing problems (for example, on the first set of images below, camera-leaf distance is approx. the same as camera-pepper distance, but there are no missing points for leaf during the whole servo)

2) Object distance < RealSense threshold

only NaNs in pointcloud
probably hardware issue

Estimation - 2D

use bbox of the object to crop RGB image
cv2 transformations to extract the pepper
find contours on the resulting image
fit ellipse to the biggest contour
angle and length of the pepper = angle and length of the major axis of the ellipse

NOT GOOD ENOUGH!

assumption that pitch will gradually converge during servo failed :(