Appendix

Simulated Training Data Generation

Environment objecs: At the beginning of each 15 second episode, 8 objects are spawned into the environment with either box or cylinder geometry with equal probability. We do not include spherical geometries as they are rare in household/kitchen settings, though we do introduce them into the possible grasped object geometries. 

The spawn position of each object is sampled uniformly in a box above the table (x: [0.2m, 0.8m],  y: [-0.38m, 0.38m], z: [0.1m, 0.4m]). Euler angles are sampled uniformly between +/- pi. The  respective dimensions for each primitive geometry (height, width, length for box, diameter and length for cylinder, and diameter for sphere) is sampled uniformly between 0.02m and 0.3m, mass sampled uniformly between 0.05kg and 2kg, and friction sampled uniformly between 0 and 1. 

Grasped object: The grasped object is spawned with the same procedure used to generate the geometry, dimensions, mass, friction as above, but the position is sampled uniformly in a cylinder defined in the end-effector frame with radius 0.001m and end-points at -0.007m and 0.001m along the z-axis. The orientation is sampled uniformly in a cone about the z-axis of the end-effector frame with an aperture angle of 0.7*pi.

Robot policy: Desired delta end-effector positions are sampled at 50Hz from an Ornstein-Uhlenbeck (OU) process (general form below) which is tracked by an impedance controller. Desired orientation is constant and chosen to match the world frame, though we use very low orientation stiffness in the impedance controller to encourage diverse orientations during contact. 

The first term in the OU process is deterministic and draws the process back to a constant mu (referred to as "drift") with linear gain theta, while the second term is the stochastic wiener process where the variance is scaled by sigma.

The desired xy trajectory is sampled independently with different parameters from the desired z trajectory as we'd like to keep the motion in the xy-plane as diverse as possible, while ensuring the end-effector is close enough to the table to make frequent contact with the environment objects and table surface.

For the xy process, we define four halfspaces in polar coordinates that contribute to the first "drift" term in the OU  process to keep the desired trajectories within a polar rectangle around the robot's workspace. These terms only become active when the end-effector leaves the respective halfspace. Hence, when the robot is within all halfspaces, the xy process becomes simply a wiener process. The polar rectangle boundaries are defined such that the radius is between 0.35m and 0.7m and the angle is between +/- 2.2 radians. For theta we choose both x and y to be 20 and for the variance matrix we choose a diagonal with both elements equal to 0.2^2

For the z process, we choose a drift that is 0.2m above the table surface, theta equal to 1, and a variance of 0.05^2

Probability Map to Contact Location Algorithm 

Tabular Results

Performance metrics with standard error in tabular form for all ablations on both sim and real data. 

Average true positive distance is computed over all true positive contact estimates and is reported in euclidean pixel distance. A rough translation of euclidean pixel distance to spatial distance in meters would be 3mm per pixel at a nominal depth of 1.2m and a depth map resolution of 480x640 pixels.

Average false positive count is the number of false positive contact estimates averaged over the total number of sample image frames. 

Darker shades indicate better performance in that column. 

Tabular Results on Sim Data

Tabular Results on Real Data