CLEVER: Stream-based Active Learning for Robust Semantic Perception from Human Instructions
Jongseok Lee, Timo Birr, Rudolph Triebel and Tamim Asfour
DLR & KIT
CLEVER: Stream-based Active Learning for Robust Semantic Perception from Human Instructions
Jongseok Lee, Timo Birr, Rudolph Triebel and Tamim Asfour
DLR & KIT
TL;DR: For the first time to our knowledge, we demonstrate stream-based active learning with deep neural networks for robust semantic perception with robots.
We propose CLEVER, a stream-based active learner for robust semantic perception with Deep Neural Networks (DNNs). Our system seeks human support when encountering failures and adapts the model online using human instructions. In this way, CLEVER can eventually accomplish the given semantic perception tasks. Our main contribution is the design of a single unified system that meet several desiderata of the stream-based active learner. Highlights from our experiments are two-fold. First, with a user validation study of 13 participants, we show that our system is capable of performing semantic scene analysis on arbitrary objects. Second, we demonstrate the capabilities of CLEVER on a humanoid robot for the perception of deformable objects. To the best of our knowledge, this work is the first to realize a stream-based active learner on a real robot, providing evidence that robustness of the DNN-based semantic perception can be improved in practice.
CLEVER is a stream-based active learning system with deep neural networks. Active learning is a paradigm in which a learning algorithm identifies the most useful unlabeled instances to learn from. In the literature, a pool of unlabeled instances is mostly assumed, resulting in the so-called pool-based active learning. In contrast, we focus on a setting in which data arrive in the stream, hence named stream-based active learning [1]. More precisely, our work is on an online active learning with batch of data stream. So, how does it work?
Let's imagine a standard training procedure. Given a set of training data, say images of t-shirts, we obtain a model, which is based on a deep neural network. Then, once we are given another t-shirt, our model makes predictions for the semantic information. However, what if the predictions are wrong? In our stream-based active learning system, the algorithm identifies wrong predictions using uncertainty estimates, and ask support from humans. When a human provides annotations that the given object is a t-shirt, we update our training data. As our system can learn the model online with the updated training data, it is now able to make correct predictions. We show that such capabilities lead to more robust semantic perception system, when compared to deploying a neural network.
Main contribution here is on addressing several desiderata of developing such a system. These desiderata are the system's ability to:
reason about generalization and uncertainty in a small data regime,
address catastrophic forgetting,
efficiently adapt the model online,
generate suitable queries and select the most informative samples.
To the best of our knowledge, CLEVER is the first system with deep neural network, that meets all these requirements, along with demonstrations on a real robotic system.
How do we achieve this? Amongst many elements of the system design, we rely on a technique called Laplace Approximation [2-5], which can perform probabilistic inference on neural networks.
When we say Bayesian Neural Networks or BNNs in short, we mean application of probabilistic inference to neural networks. The outcome is a distribution over the model parameters, which is visualized on top. This is different to a standard deep learning, where we have deterministic model weights. Advantages of BNNs are many, but here, we are talking about good calibrated uncertainty estimates. Laplace Approximation is one type of a probabilistic inference technique, where we impose Gaussian distribution over the model parameters.
Laplace Approximation mainly involves three different steps. First, we need to infer the model parameters that are MAP (maximum a-posteriori). This is the same as the standard deep learning, where we optimize for the model parameters that best fit our training data with stochastic gradient descent or some variant. Second, we infer the Hessian of the loss with respect to the parameters. Such Hessian matrices are often used for 2nd order optimization, and there are many good ways to obtain the Hessian. Then, we invert the Hessian matrix. In Laplace Approximation, one can how that the inverse of the Hessian can form a covariance matrix over the model parameters. After these three steps, we obtained a Gaussian distribution with a mean and a covariance.
We go one more step than this. The reason is to generalize well, and provide good uncertainty estimates under small data regime, which is simply the nature of our problem in stream-based active learning. That means, we cannot assume big data and perform probabilistic inference. In such settings, prior becomes very important. In our work, we learn that prior from the relevant previous tasks and dataset. In other words, we use the posterior distribution from the previous task as informative prior for the current task, resulting in a Bayesian continual learning framework. In the visualization above, we eluded the use of simulation, but it may apply also without.
Feel free to read the paper for more details. I have also some reading lists at the bottom of this page.
Below, we present videos from the additional real world experiments.
This video shows the CLEVER system being deployed to a humanoid robot, ARMAR 6. It is a single take sequence without any cuts. ARMAR learns semantic perception of a deformable object with language.
Learning a hour glass.
Learning a kendo head cover.
Learning a robot doll.
Learning a rock .
Learning an actuator.
Closing sim2real.
Here, we learn and evaluate on an apple, a name card and a robot doll in a sequence.
This video shows ARMAR closing sim2real gap for two objects namely apple and banana. Then, ARMAR learns T-shirt both in a rigid shape and also with deformation. The created deformation makes the robot uncertain, generating new queries.
This video shows CLEVER learning multiple objects. We trained arbitrary five objects within a reasonable time-line. Then, we evaluated the system performance for the set of chosen objects, which were arbitrarily picked.
Coming soon....
The authors would like to acknowledge many DLR researchers who participated in a user validation study. Thank you Rudolph, Ribin, Riccardo, Uli, Joao, Simran, Hojune, Korbi, Thomas, Leonhard, Arjun, Julius, and Antonin! Special thanks to Max Durner for the laptop, and David for initial investigations of continual learning component on TORO. We also acknowledge the funding support from Helmholtz AI for funding a research visit to KIT and also the EU Inverse project.
[1] Cacciarelli, D., & Kulahci, M. (2024). Active learning for data streams: a survey. Machine Learning, 113(1), 185-239.
[2] Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., ... & Zhu, X. X. (2023). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1), 1513-1589.
[3] Lee, J., Humt, M., Feng, J., & Triebel, R. (2020, November). Estimating model uncertainty of neural networks in sparse information form. In International Conference on Machine Learning (pp. 5702-5713). PMLR.
[4] Lee, J., Feng, J., Humt, M., Müller, M. G., & Triebel, R. (2022, January). Trust your robots! predictive uncertainty estimation of neural networks with sparse gaussian processes. In Conference on Robot Learning (pp. 1168-1179). PMLR.
[5] Schnaus, D., Lee, J., Cremers, D., & Triebel, R. (2023, July). Learning expressive priors for generalization and uncertainty estimation in neural networks. In International Conference on Machine Learning (pp. 30252-30284). PMLR.
[6] Narr, A., Triebel, R., & Cremers, D. (2016, May). Stream-based active learning for efficient and adaptive classification of 3d objects. In 2016 IEEE International Conference on Robotics and Automation (ICRA) (pp. 227-233). IEEE.
[7] Triebel, R., Grimmett, H., Paul, R., & Posner, I. (2016). Driven learning for driving: How introspection improves semantic mapping. In Robotics Research: The 16th International Symposium ISRR (pp. 449-465). Springer International Publishing.
[8] Mund, D., Triebel, R., & Cremers, D. (2015, May). Active online confidence boosting for efficient object classification. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1367-1373). IEEE.
[9] Tao, Y., Triebel, R., & Cremers, D. (2015, September). Semi-supervised online learning for efficient classification of objects in 3d data streams. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 2904-2910). IEEE.
[10] Denninger, M., & Triebel, R. (2018, October). Persistent anytime learning of objects from unseen classes. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4075-4082). IEEE.
Please feel free to contact me (jongseok.lee@dlr.de) if you have any recommended paper on this topic.