Heecheol Kim

I am a Ph.D. student at the University of Tokyo researching deep learning for robotics under the supervision of Prof. Yasuo Kuniyoshi. My research focuses on deep imitation learning-based robot manipulation to build an intelligent robot that can acquire human-like dexterous manipulation skills, such as needle threading and banana peeling.

Google Scholar LinkedIn Github

Email: h-kim@isi.imi.i.u-tokyo.ac.jp

Heecheol Kim is a Ph. D. student at the Intelligent Systems and Informatics Laboratory, the University of Tokyo. He received his bachelor's and master's in mechano-informatics at the University of Tokyo in '17, and '21, respectively. He joined the eighth U.S Army as the Korean Augmentee to the U.S Army (KATUSA) from '17 to '19 and received the Army Commendation Medal during his service. His work was featured in many popular presses, including Reuters, NBC News, New Scientist. His research works have been acknowledged by the Dean's Award at the University of Tokyo '21, Robotics and Automation Society Japan Joint Chapter Young Award '21, JSPS Research Fellowship for Young Scientists (DC2), and NEC C&C Foundation Grant for Non-Japanese Researchers.

Research

Robot peels banana with goal-conditioned dual-action deep imitation learning

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Arxiv 2022 (Submitted to Transactions on Robotics (T-RO))

The long-horizon dexterous robot manipulation of deformable objects, such as banana peeling, is a problematic task because of difficulties in object modeling and a lack of knowledge about stable and dexterous manipulation skills. This paper presents a goal-conditioned dual-action deep imitation learning (DIL) approach that can learn dexterous manipulation skills using human demonstration data. Previous DIL methods map the current sensory input and reactive action, which often fails because of compounding errors in imitation learning caused by the recurrent computation of actions. The proposed method predicts reactive action when the precise manipulation of the target object is required (local action) and generates the entire trajectory when precise manipulation is not required (global action). This dual-action formulation effectively prevents compounding error in the imitation learning using the trajectory-based global action while responding to unexpected changes in the target object during the reactive local action. Furthermore, in this formulation, both global and local actions are conditioned by a goal state that is defined as the last step of each subtask so that policy prediction is robust. The proposed method was tested in a real dual-arm robot and successfully accomplished the banana-peeling task.

Training Robots without Robots: Deep Imitation Learning for Master-to-Robot Policy Transfer

Heecheol Kim, Yoshiyuki Ohmura, Akihiko Nagakubo, Yasuo Kuniyoshi

Arxiv 2022 (Submitted to International Conference on Intelligent Robots and Systems (IROS))

Deep imitation learning is a promising method for dexterous robot manipulation because it only requires demonstration samples for learning manipulation skills. In this paper, deep imitation learning is applied to tasks that require force feedback, such as bottle opening. However, simple visual feedback systems, such as teleoperation, cannot be applied because they do not provide force feedback to operators. Bilateral teleoperation has been used for demonstration with force feedback; however, this requires an expensive and complex bilateral robot system. In this paper, a new master-to-robot (M2R) transfer learning system is presented that does not require robots but can still teach dexterous force feedback-based manipulation tasks to robots. The human directly demonstrates a task using a low-cost controller that resembles the kinematic parameters of the robot arm. Using this controller, the operator can naturally feel the force feedback without any expensive bilateral system. Furthermore, the M2R transfer system can overcome domain gaps between the master and robot using the gaze-based imitation learning framework and a simple calibration method. To demonstrate this, the proposed system was evaluated on a bottle-cap-opening task that requires force feedback only for the master demonstration.

Transformer-based deep imitation learning for dual-arm robot manipulation

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021

Deep imitation learning is promising for solving dexterous manipulation tasks because it does not require an environment model and pre-programmed robot behavior. However, its application to dual-arm manipulation tasks remains challenging. In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks. We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements. A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world. The proposed method has been tested on dual-arm manipulation tasks using a real robot. The experimental results demonstrated that the Transformer-based deep imitation learning architecture can attend to the important features among the sensory inputs, therefore reducing distractions and improving manipulation performance when compared with the baseline architecture without the self-attention mechanisms.

Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

IEEE Robotics and Automation Letters (RA-L), 2021

A high-precision manipulation task, such as needle threading, is challenging. Physiological studies have proposed connecting low-resolution peripheral vision and fast movement to transport the hand into the vicinity of an object, and using high-resolution foveated vision to achieve the accurate homing of the hand to the object. The results of this study demonstrate that a deep imitation learning based method, inspired by the gaze-based dual resolution visuomotor control system in humans, can solve the needle threading task. First, we recorded the gaze movements of a human operator who was teleoperating a robot. Then, we used only a high-resolution image around the gaze to precisely control the thread position when it was close to the target. We used a low-resolution peripheral image to reach the vicinity of the target. The experimental results obtained in this study demonstrate that the proposed method enables precise manipulation tasks using a general-purpose robot manipulator and improves computational efficiency.

Memory-based gaze prediction in deep imitation learning for robot manipulation

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

In IEEE International Conference on Robotics and Automation (ICRA), 2022

Deep imitation learning is a promising approach that does not require hard-coded control rules in autonomous robot manipulation. The current applications of deep imitation learning to robot manipulation have been limited to reactive control based on the states at the current time step. However, future robots will also be required to solve tasks utilizing their memory obtained by experience in complicated environments (e.g., when the robot is asked to find a previously used object on a shelf). In such a situation, simple deep imitation learning may fail because of distractions caused by complicated environments. We propose that gaze prediction from sequential visual input enables the robot to perform a manipulation task that requires memory. The proposed algorithm uses a Transformer-based self-attention architecture for the gaze estimation based on sequential data to implement memory. The proposed method was evaluated with a real robot multi-object manipulation task that requires memory of the previous states.

Using human gaze to improve robustness to irrelevant objects in robot manipulation tasks

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

IEEE Robotics and Automation Letters (RA-L), 2020

Deep imitation learning enables the learning of complex visuomotor skills from raw pixel inputs. However, this approach suffers from the problem of overfitting to the training images. The neural network can easily be distracted by task-irrelevant objects. In this letter, we use the human gaze measured by a head-mounted eye tracking device to discard task-irrelevant visual distractions. We propose a mixture density network-based behavior cloning method that learns to imitate the human gaze. The model predicts gaze positions from raw pixel images and crops images around the predicted gazes. Only these cropped images are used to compute the output action. This cropping procedure can remove visual distractions because the gaze is rarely fixated on task-irrelevant objects. This robustness against irrelevant objects can improve the manipulation performance of robots in scenarios where task-irrelevant objects are present. We evaluated our model on four manipulation tasks designed to test the robustness of the model to irrelevant objects. The results indicate that the proposed model can predict the locations of task-relevant objects from gaze positions, is robust to task-irrelevant objects, and exhibits impressive manipulation performance especially in multi-object handling.

Disentangled Representations for Sequence Data using Information Bottleneck Principle

Masanori Yamada1st, Heecheol Kim1st, Kosuke Miyoshi, Tomoharu Iwata, Hiroshi Yamakawa

In Asian Conference on Machine Learning (ACML), 2020

We propose the factorizing variational autoencoder (FAVAE), a generative model for learning dis- entangled representations from sequential data via the information bottleneck principle without supervision. Real-world data are often generated by a few explanatory factors of variation, and disentangled representation learning obtains these factors from the data. We focus on the disen- tangled representation of sequential data which can be useful in a wide range of applications, such as video, speech, and stock markets. Factors in sequential data are categorized into dynamic and static ones: dynamic factors are time dependent, and static factors are time independent. Previous models disentangle between static and dynamic factors and between dynamic factors with different time dependencies by explicitly modeling the priors of latent variables. However, these models cannot disentangle representations between dynamic factors with the same time dependency, such as disentangling “picking up” and “throwing” in robotic tasks. On the other hand, FAVAE can disentangle multiple dynamic factors via the information bottleneck principle where it does not require modeling priors. We conducted experiments to show that FAVAE can extract disentangled dynamic factors on synthetic, video, and speech datasets.

Reinforcement Learning in Latent Action Sequence Space

Heecheol Kim1st, Masanori Yamada1st, Kosuke Miyoshi, Tomoharu Iwata, Hiroshi Yamakawa

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

One problem in real-world applications of rein- forcement learning is the high dimensionality of the action search spaces, which comes from the combination of actions over time. To reduce the dimensionality of action sequence search spaces, macro actions have been studied, which are sequences of primitive actions to solve tasks. However, previous studies relied on humans to define macro actions or assumed macro actions to be repetitions of the same primitive actions. We propose encoded action sequence reinforcement learning (EASRL), a reinforcement learning method that learns flexible sequences of actions in a latent space for a high-dimensional action sequence search space. With EASRL, encoder and decoder networks are trained with demonstration data by using variational autoencoders for mapping macro actions into the latent space. Then, we learn a policy network in the latent space, which is a distribution over encoded macro actions given a state. By learning in the latent space, we can reduce the dimension- ality of the action sequence search space and handle various patterns of action sequences. We experimentally demonstrate that the proposed method outperforms other reinforcement learning methods on tasks that require an extensive amount of search.