Fundamental Challenges in Deep Learning for Stiff Contact Dynamics
Mihir Parmar*, Mathew Halm*, Michael Posa
Abstract
Frictional contact has been extensively studied as the core underlying behavior of legged locomotion and manipulation, and its nearly-discontinuous nature makes planning and control difficult even when an accurate model of the robot is available. Here, we present empirical evidence that learning an accurate model in the first place can be confounded by contact, as modern deep learning approaches are not designed to capture this non-smoothness. We isolate the effects of contact's non-smoothness by varying the mechanical stiffness of a compliant contact simulator. Even for a simple system, we find that stiffness alone dramatically degrades training processes, generalization, and data-efficiency. Our results raise serious questions about simulated testing environments which do not accurately reflect the stiffness of rigid robotic hardware. Significant additional investigation will be necessary to fully understand and mitigate these effects, and we suggest several avenues for future study.
Video
Illustrative Examples
When robots and their environment collide--such as when a running robot's foot hits the ground-- a violent and nearly-instantaneous physical process occurs, resulting in large changes in velocity. The underlying material property driving this rapidity, mechanical stiffness, causes multiple forms of numerical stiffness in the equations of motion of these systems, which in turn makes them difficult to accurately model from noisy data.
Rolling Block
To illustrate this issue, we first consider the case of perfectly rigid bodies, which can be considered the limit as stiffness approaches infinity. Sensitivity to initial conditions and near-instantaneous impact of a 2D block on flat ground are shown above. Two trajectories begin from nearly identical initial conditions (left), where the block (blue) contacts the ground (yellow) at 1 corner; the center of mass is to the left of the contact point in the upper trajectory and to the right in the lower one.
The first form of numerical stiffness induced by contact is sensitivity in initial conditions. We can see that after some time has elapsed (center), the state of the cube differs drastically in the two trajectories, as one tips to the right as the other tips to the left.
The second form is near-discontinuity in time. In both trajectories, when the block impacts the ground (right), the angular velocity suddenly jumps from a large value to zero. Discontinuity in time is particularly challenging for learning from noisy data, as measurements of the velocities become extremely sensitive to the time that they are recorded (i.e. sensitivity to jitter).
Bouncing Ball
We next considering a system where contact has finite stiffness k, and examine how the numerical issues induced by increasing k toward infinity (i.e. perfect rigidity) hamper the process of learning a model.
Above on the left (Fig. a), a point mass (blue) falls from an initial height z = 1 toward compliant ground (yellow), modeled as a spring-damper system. Given this initial position, the plots on the right (Fig. b) show how the initial velocity and contact stiffness determine the final velocity ("Ground Truth"). As we saw in the rolling block example above, very high stiffness can create sensitivity to initial condition, as the k = 2500 plot has a near-discontinuity near 3.9.
For each of two stiffnesses k, we train 100 models on different sets of 20 training and 20 validation data points (grey dots), each chosen uniformly at random with gaussian random noise (variance 0.01). Each model is an MLP with two hidden layers of width 128, trained with brute-force optimized learning rate and weight decay. The average prediction of these models with a 1 std. dev. window are plotted in Fig. b (red). Learning performance is heavily degraded on the stiffer k = 2500 system; training loss, ground-truth mean square error, and inter-model variance are 197%, 413%, and 309% higher than for k = 100.
IROS 2021 Experiments
Hypotheses
Given our observations in the above examples, we hypothesize how and why mechanical stiffness can effect the learning process. Broadly, the goal of methods which learn a robot's dynamics is to have low prediction error at test time. We conjecture that 3 distinct phenomena together contribute to higher test-set loss in systems with higher stiffness:
Ground-truth models make worse predictions on noisy data. This is likely the case as stiffer systems are more sensitive to initial condition.
Network training converges to worse local minima at training time. Stiffness in the dynamics in turn generates numerical stiffness and local minima in the optimization landscape; therefore, it is likely more difficult for common deep learning optimizers like Adam to converge to a good minimum.
Learned models have poor generalization to test data. Common deep learning architectures are fundamentally based on a bias towards smoothness; therefore, as the underlying true systems is non-smooth, this bias in fact decreases the likelihood that the training process recovers a system close to the ground-truth.

Soft Contact

Medium Contact

Hard Contact
Methodology
To isolate these potential effects of stiffness on learning, we compare learning performance on 3 simulated environments (Soft, Medium, and Hard), which differ only in the stiffness of the simulated contact. Each environment contains the same simple example system: a cube impacting flat ground. For each environment, we collect noisy datasets of varying sizes, and add a small amount of uniform random noise. We then compare the performance of learned models that have undergone meticulous hyperparameter optimization to reduce test-set loss (L2 single-step prediction error). To capture the three effects listed above, we decompose test-set L2 error into three corresponding quantities for each environment: L2 error of the ground-truth models on training data ("Oracle error"); gap between learned and ground-truth models on training data ("Train error - Oracle error"); and the gap between training-set and test-set L2 error of learned models ("Generalization error").
In addition to training and testing on single-step predictions, we also evaluate the learned models' long-term prediction error, and provide the ground-truth model's long-term error for reference.
Oracle error
Hard: 0.0836 +/- .0018
Medium: 0.011 +/- .00018
Soft: 0.0032 +/- .000019
Results
Single-step Prediction
For single-step predictions, we find supporting evidence for our three hypotheses. First, we note that the ground-truth model accuracy degrades monotonically with contact stiffness; More surprisingly, we find that training error of learned models degrades with stiffness nearly twice as fast as the ground-truth model, even for single-step predictions; finally, while we see generalization error trending to zero for the lower stiffness settings, test error stays significantly and consistently higher than training error for stiff systems across a wide range of dataset sizes. In all, the test-set performance of our stiffest model performs worse than the softest model, even with 100x the training data (5000 vs. 50 training trajectories).
Ground-Truth Pos. Error
Hard: 4.31 +/- 0.13
Medium: 3.57 +/- 0.08
Soft: 2.92 +/- 0.04
Ground-Truth Rot. Error
Hard: 3.98 +/- 0.03
Medium: 3.26 +/- 0.03
Soft: 2.77 +/- 0.03
Long-Term Prediction
We see that the long term prediction error of the learned models are around an order of magnitude worse than the ground-truth model; furthermore, we see at almost every dataset size that the performance of stiffer learned models is worse than their softer counterparts. Visualizations comparing the learned models' rollouts to the ground truth trajectories are shown below.

Soft Rollout

Medium Rollout
Blue: Ground-truth
Red: Learned model prediction

Hard Rollout
Related Work
Multi-Modality
One approach to handling near-discontinuity is to learn a "multi-modal" model--essentially, learning a piecewise-smooth function by learning all of the individual pieces and their domains ([1], [2]). This approach is certainly mathematically sound and practically applicable to several robotic tasks; the limiting factor for more general application is that for more complex tasks (e.g. in-hand manipulation with a human-like hand), contact can generate thousands of different modes.
Embedding Physics into Learned Models
There are many recent works that explicitly bake the structure of physics and contact into learned models. A consistent results is that these methods have better inductive biases than basic DNN's, and thus has better generalization error. A popular group of these methods is differentiable physics, in which a physics-based model is trained via gradient descent. This however does not remedy the challenging optimization landscape issues, limiting results to tuning a handful of parameters, and often requiring prohibitively expensive, second-order or global optimization techniques ([3],[4],[5]). Our recent work, ContactNets, attempts to further solve the training difficulties with a novel, physics-inspired loss, rather than directly minimizing prediction error [6].