Perception
Deep learning approach - SSD
Single Shot Detector (SSD) is a single-stage object detector which gives real-time performance. It is based on convolutional neural network (CNN) structure.
As opposed to region-based methods, which first generate region proposals and then detect the object of each proposal, SSD takes one single shot to detect multiple objects. The higher speed didn't come at the cost of reducing the accuracy - SSD, with 76.9% mAP at 22 FPS, outperforms Faster R-CNN (73.2% mAP at 7 FPS) and YOLOv1 (63.4 mAP at 45 FPS).
MobileNet V1 is used as a base of SSD. Additional convolutional layers which progressively decrease in size are added to the end of the base network.
Training details
Tensorflow Object Detection API + NVidia CUDA Toolkit
Model: ssd_mobilenet_v1_coco
Transfer learning: Model is pre-trained on the COCO dataset
the assumption is that lower layers (the ones closer to the inputs) have learned general features (lines, edges..) which are not specific to the COCO dataset.
Dataset:
synthetic dataset proceduraly generated in Blender (540 images)
manually labeled real images dataset (43 images + augmentation)
mainly trained on red bell peppers
good detection
bad detection
3D detection pipeline
Detection
inference on RGB image from RealSense D435 as an input (rate: ~15Hz)
extract bboxes of the peppers
filter organized point cloud with bbox to get the centroid of each pepper (some are NaN! - publishing them anyway, just so we know that the object is present)
match peppers detected from multiple frames using distances between the centroids
choose closest one among those visible from at least 2 frames
estimate orientation and size of the pepper so we can position for the grasping
Estimation
v1: single pointcloud - failed as surface of the pepper is reflective, there was no data in pc for some parts of the pepper
v2: estimation in 2D using RGB image - failed as surface of the pepper is reflective + servo does not reach expected perpendicular orientation
v3: estimation in 3D using multiple point clouds
Servo
keep track of the closest pepper through the frames
if the distance between the centroids of the closest pepper in two consecutive frames is above threshold, servo is aborted
if object is visible, but centroid is NaN (which happens when the distance between the camera and the object is under 10-15cm), object position is set as the mean value of all of the previous measurements [todoAnalysis]
6DOF PEPPER POSE ESTIMATION
input:
bboxes
point cloud (PC)
camera-world transform
in several frames
organized PC is in camera frame, so following transformations can be applied:
subcloud which contains pepper is extracted from the PC based on the bbox of the pepper
points are filtered along local z axis --> only keep points at 5cm distance from the pepper centroid along z axis
transform filtered PC to the global frame
iterative closest point algorithm is used to incrementally register a series of pointclouds two by two
cylinder is segmented from the final model (length of such cylinder is infinite, pepper rotation = cylinder rotation, pepper radius = cylinder radius)
remaining points [
todo:explain-> points inside the cylinder] are projected onto a line (axis of the cylinder) in order to calculate the length of the pepperpepper rotation, length and radius are returned to the control node
TODO: check how this works from greater distance
TODO: check how it works with occlusions
TODO: also use cylinder width as grasp input information
ready to grasp!
(based on 3d estimation)
icp registration
fitting a cylinder
segmentation and projection
final visualization
Analysis
Basic cases - without occlusions
Detected object position - search
max distance between the points: 1.98 cm
max distance between the points: 3.08 cm
max distance between the points: 2.9 cm
first 3 detections are from 'search', the others are from 'servo'
in general, position detected in servo is the same as position detected in 'search' state from which the trajectory is planned
Detected object position over time - servo
only detections are plotted (steps in which we used estimation based on previous measurements are ignored)
number of detections differs in trials because:
we varied goal distance in the trajectory planning step
number of steps in which we estimated object position (cos there was NaN in pc for pepper centroid) was different for different objects
when there are no occlusions, difference in detected object position during whole servo is under 1cm
in this case the output of the detection was almost binary
result of visualizing bbox of goal object for both cases can be seen in figures below
leaf was detected as pepper ---> TODO: consider lowering threshold distance between the objects in consecutive frames after analysing the cases with occlusions
130 detections because we opted for the leaf when pepper centroid was null instead of estimating based on previous detections
Pointcloud - reflections and NaNs
1) Object distance > RealSense threshold*
*RealSense threshold is around 15cm, there is no data in pc for objects with z < 15cm
pointcloud from 'search'
more representative than the ones from 'servo'
TODO: use only pc from 'search' for pepper pose estimation?
the closer we are, the less of the pepper is visible on pc
reflective surface of the pepper might be causing problems (for example, on the first set of images below, camera-leaf distance is approx. the same as camera-pepper distance, but there are no missing points for leaf during the whole servo)
2) Object distance < RealSense threshold
only NaNs in pointcloud
probably hardware issue
Estimation - 2D
use bbox of the object to crop RGB image
cv2 transformations to extract the pepper
find contours on the resulting image
fit ellipse to the biggest contour
angle and length of the pepper = angle and length of the major axis of the ellipse
NOT GOOD ENOUGH!
assumption that pitch will gradually converge during servo failed :(