Research

Broad Research Area: Robotics and Artificial Intelligence


In the area of Generative AI, Intelligent Robot Grasping and Mobile Manipulation

Recent advancements in Generative Artificial Intelligence, particularly in the realm of Large Language Models (LLMs) and Large Vision Language Models (LVLMs), have enabled the prospect of leveraging cognitive planners within robotic systems. This work focuses on solving the object goal navigation problem by mimicking human cognition to attend, perceive and store task specific information and generate plans with the same. We introduce a comprehensive framework capable of exploring an unfamiliar environment in search of an object by leveraging the capabilities of Large Language Models (LLMs) and Large Vision Language Models (LVLMs) in understanding the underlying semantics of our world. A challenging task in using LLMs to generate high level subgoals is to efficiently represent the environment around the robot. We propose to use a 3D scene modular representation, with semantically rich descriptions of the object, to provide the LLM with task relevant information. But providing the LLM with a mass of contextual information (rich 3D scene semantic representation), can lead to redundant and inefficient plans. We propose to use an LLM based pruner that leverages the capabilities of in-context learning to prune out irrelevant goal specific information. (View more)

Reinforcement learning and Imitation Learning approaches utilize policy learning strategies that are difficult to generalize well with just a few examples of a task. In this work, we propose a language-conditioned semantic search-based method to produce an online search-based policy from the available demonstration dataset of state-action trajectories. Here we directly acquire actions from the most similar manipulation trajectories found in the dataset. Our approach surpasses the performance of the baselines on the CALVIN benchmark and exhibits strong zero-shot adaptation capabilities. This holds great potential for expanding the use of our online search-based policy approach to tasks typically addressed by Imitation Learning or Reinforcement Learning-based policies.  (View more)

In the realm of computer vision and robotics, the pursuit of intelligent robotic grasping and accurate 6D object pose estimation has been a focal point of research. Many modern-world applications, such as robot grasping, manipulation, and palletizing, require the correct pose of objects present in a scene to perform their specific tasks. The estimation of a 6D object pose becomes even more challenging due to inherent complexities, especially when dealing with objects positioned within cluttered scenes and subjected to high levels of occlusion. While prior endeavors have made strides in addressing this issue, their accuracy falls short of the reliability demanded by real-world applications. In this research, we present an architecture that, unlike prior works, incorporates contextual awareness. This novel approach capitalizes on the contextual information attainable about the objects in question. The framework we propose takes a dissection approach, discerning objects by their intrinsic characteristics, namely whether they are symmetric or non-symmetric. Notably, our methodology employs a more profound estimator and refiner network tandem for non-symmetric objects, in contrast to symmetric ones. This distinction acknowledges the inherent dissimilarities between the two object types, thereby enhancing performance. Through experiments conducted on the LineMOD dataset, widely regarded as a benchmark for pose estimation in occluded and cluttered scenes, we demonstrate a notable improvement in accuracy of approximately 3.2% compared to the previous state-of-the-art method, DenseFusion. Moreover, our results indicate that the achieved inference time is sufficient for real-time usage. Overall, our proposed architecture leverages contextual information and tailors the pose estimation process based on object types, leading to enhanced accuracy and real-time performance in challenging scenarios. (View more)

Here we propose a deep neural network based architecture which we name as Generative Inception Neural Network (GI-NNet), capable of predicting antipodal robotic grasps intelligently, on seen as well as unseen objects. It is trained on Cornell Grasping Dataset (CGD) and attains a 98.87% grasp pose accuracy for detecting both regular/irregular shaped objects from RGB-Depth images while requiring only one-third of the network trainable parameters as compared to the existing approaches. However, to attain this level of performance the model requires the entire 90% of the available labelled data of CGD keeping only 10% labelled data for testing which makes it vulnerable to poor generalization. Furthermore, getting a sufficient and quality labelled dataset for robot grasping is extremely difficult. To address these issues, we subsequently propose another architecture where our proposed GI-NNet model is attached as a decoder of a Vector Quantized Variational Auto-Encoder (VQ-VAE), which works more efficiently when trained both with the available labelled and unlabelled data. The proposed model, which we name as Representation based GI-NNet (RGI-NNet) has been trained utilizing the various split of available CGD dataset to test the learning ability of our architecture starting from only 10% label data with the latent embedding of VQ-VAE to 90% label data with the latent embedding. However, being trained with only 50% label data of CGD with latent embedding, the proposed architecture produces the best results which, we believe, is a remarkable accomplishment. The logical reasoning of this together with the other relevant technological details have been elaborated in this paper. The performance level, in terms of grasp pose accuracy of RGI-NNet, varies between 92.1348% to 97.7528% which is far better than several existing models trained with only labelled dataset. For the performance verification of both the proposed models, GI-NNet and RGI-NNet, we have performed rigorous experiments on Anukul (Baxter) hardware cobot. ( View more).

The current warehouses deploy robot fleets to carry products on a predefined path to a human co-worker. Packing these items can be strenuous and repetitive, making the human’s job prone to long-term injuries. There is a need to improve both the intelligence and capabilities of the autonomous delivery and packing to facilitate smooth collaboration while increasing safety and efficiency. In this project, we look at: 

1. modelling human behaviour;

2. socially-aware navigation among human co-workers; and 

3. intelligent manipulation for packing. 

We address challenges around working with humans in the loop safely, especially considering occlusions and semi-structured environments (e.g., crowded intersections). The key contributions of the project lie in developing a safe human-aware Robot Crowd Navigation in warehouse scenarios using accelerated Reinforcement Learning and Human Motion Prediction. We are developing optimization and initialization methods for slim/sparse neural network architectures for intelligent Robot Grasping. We deploy and evaluate our methods in simulation and on re-creation warehouse environments. ( View more here)

(published in  IEEE 18th International Conference on Automation Science and Engineering (CASE) August 20-24, 2022, Mexico City).

 There is a sudden surge in the number of people ordering items online that are facilitated through warehouses. Modern-day warehouses use robots for most of the tasks including picking and sorting. Efficiency is of prime concern in any warehouse, while current efforts in the literature are restricted to solving the optimization of warehouse processes as an operational research problem using fixed travel costs. As the number of orders and robots increases, congestion is witnessed in the warehouse that invalidates fixed travel costs assumptions. To facilitate research in warehousing we first propose a modular simulator for the warehouse that simulates the business operations of order generation, order fulfillment scheduling, item picking, and sorting. The simulator also models the robots for traveling within the warehouse network, scheduling charging, intersection management, and congestion management. With the increasing demand, the warehouses increase the number of robots making the transportation network operate beyond capacity. In this direction, we analyze the performance of the warehouse from a transportation perspective using fundamental diagrams. The warehouses contain only controllable entities (robots) that enable predicting the congestion levels for solving the planning problem. The results show improvements of around 5% in the order fulfillment time and the number of picks that can significantly increase the profitability of the warehouse. 

( View more here)

For a robot to perform complex manipulation tasks, it is necessary for it to have a good grasping ability. However, vision based robotic grasp detection is hindered by the unavailability of sufficient labelled data. Furthermore, the application of semi-supervised learning techniques to grasp detection is underexplored. In this paper, a semi-supervised learning based grasp detection approach has been presented, which models a discrete latent space using a Vector Quantized Variational AutoEncoder (VQ-VAE). To the best of our knowledge, this is the first time a Variational AutoEncoder (VAE) has been applied in the domain of robotic grasp detection. The VAE helps the model in generalizing beyond the Cornell Grasping Dataset (CGD) despite having a limited amount of labelled data by also utilizing the unlabelled data. This claim has been validated by testing the model on images, which are not available in the CGD. Along with this, we augment the Generative Grasping Convolutional Neural Network (GGCNN) architecture with the decoder structure used in the VQ-VAE model with the intuition that it should help to regress in the vector-quantized latent space. Subsequently, the model performs significantly better than the existing approaches which do not make use of unlabelled images to improve the grasp. (View more).

Solving Intelligent object grasping problem in an unstructured environment by a robot manipulator is a challenging task. To grasp an object, the robot should know the position of the object in an environment, decide how and where the hand gripper should be moved and then finally determine how the object is to be held. We have proposed a Hybrid architecture for detecting optimal robotic grasp. The algorithms till now have successfully been applied to RGB images for training and for testing as well. A new hybrid architecture presented in this paper is as follows First, a convolutional neural network (ResNet-50) pre-trained by transfer learning performs regression to grasping rectangles, which will generate multiple rectangles on a single image. Second, an Auto-Encoder would predict quality score for all rectangles regressed by convolutional neural network and choose an optimal rectangle among them. Cornell Grasping Dataset has been used for training and testing purposes but since the dataset is very small, to generalize the model, augmentation has been performed on the images to generate more images. The hybrid architecture gives accuracy of 75.34% on object-wise split and 75.81% on image-wise split. (View more).

Intelligent robot grasping is a very challenging task due to its inherent complexity and non availability of sufficient labelled  data. Since making suitable labelled  data available for effective training for any deep learning based model including deep reinforcement learning is so crucial for successful grasp learning, in this research, we propose to solve the problem of generating grasping Poses/Rectangles using a Pix2Pix Generative Adversarial Network (Pix2Pix GAN), which takes an image of an object as input and produces the grasping rectangle tagged with the object as output. Here, we have proposed an end-to-end grasping rectangle generating methodology and embedding it to an appropriate place of an object to be grasped. We have developed two modules to obtain an optimal grasping rectangle. With the help of the first module, the pose (position and orientation) of the generated grasping rectangle is extracted from the output of Pix2Pix GAN,  and then the extracted grasp pose is translated to the centroid of the object, since here we hypothesize that like the human way of grasping of regular shaped objects, the center of mass/centroids are the best places for stable grasping. For other irregular shaped objects, we allow the generated  grasping rectangles as it is to be fed to the robot for grasp execution. The accuracy has significantly improved for generating the grasping rectangle with limited number of Cornell Grasping Dataset augmented by our proposed approach to the extent of  87.79%. Rigorous experiments with the Anukul/Baxter robot, which has 7 degrees of freedom (DOF), causing redundancy have been performed. At the grasp execution level, we propose to solve the inverse kinematics problems for such robots using Numerical Inverse-Pose solution together with Resolve-Rate control which proves to be more computationally efficient due to the common sharing of the Jacobian matrix. Experiments show that our proposed generative model based approach gives the promising results in terms of executing successful grasps for seen as well as for unseen objects. (View more).

Intelligent object manipulation for grasping is a challenging problem for robots. Unlike robots, humans almost immediately know how to manipulate objects for grasping due to learning over the years. In this paper, we have developed learning-based pose estimation by decomposing the problem into both position and orientation learning. More specifically, for grasp position estimation, we explore three different methods such as genetic algorithm (GA)-based optimization method to minimize error between calculated image points and predicted end-effector (EE) position, a regression-based method (RM) where collected data points of robot EE and image points have been regressed with a linear model, a pseudoinverse (PI) model which has been formulated in the form of a mapping matrix with robot EE position and image points for several observations. Further for grasp orientation learning, we develop a deep reinforcement learning (DRL) model which we name as grasp deep Q-network (GDQN) and benchmarked our results with Modified VGG16 (MVGG16). Rigorous experimentation shows that due to inherent capability of producing very high-quality solutions for optimization problems and search problems, GA-based predictor performs much better than the other two models for position estimation. For orientation, learning results indicate that off policy learning through GDQN outperforms MVGG16, since GDQN architecture is specially made suitable for the reinforcement learning. Experimentation based on our proposed architectures and algorithms shows that the robot is capable of grasping nearly all rigid body objects having regular shapes. (View more).

In the area of Robot Manipulation

Learning inverse kinematics of humanoid and collaborative robots, which have inherent kinematic redundancy, is a challenging problem due to its multivalued nature. Since these robots hardly obey Pieper’s recommendation (Pieper and Roth 1969), solutions to the inverse kinematics problem cannot always be obtained analytically. Recently, Invertible Neural Networks (INNs) have found success in solving such illposed inverse problems. In this work, we empirically show that density constraints on the latent variables while training INNs could be replaced by an ex-post density estimation step. The advantage is twofold; the latent variables could have an arbitrarily complex distribution, and posterior mismatch is no longer an issue. Through experiments on learning the inverse kinematics of planar redundant serial robotic manipulators, we validate the efficacy of our approach. (View more)

Developing behavior based robotic manipulation is a very challenging but necessary task to be solved, especially for humanoid and social robots. Fundamental robotic tasks such as grasping, pick and place, trajectory following are at present solved using conventional forward and inverse kinematics (IK), dynamics and trajectory planning, whereas we learn these complex tasks using past experiences. In this paper, we explore developing behavior based robotic manipulation using reinforcement learning, more specifically learning directly from experiences through interactions with the real world and without knowing the transition model of the environment. Here, we propose a multi agent paradigm to gather experiences from multiple environments in parallel along with a model for populating new generation of agents using Evolutionary Actor-Critic Algorithm (EACA). The agents are of actor-critic architecture and both of them comprises of general purpose neural networks. The actor-critic architecture enables the model to perform well both in high dimensional state space and high dimensional action space which is very crucial for all robotic applications. The proposed algorithm is benchmarked with respect to different multi agent paradigm but keeping the agent’s architecture same. Reinforcement learning, being highly data intensive, requires the use of the CPU and GPU cores to be done judiciously for sampling the environment as well as for training, the details of which have been described here. We have run rigorous experiments for learning joint trajectories on the open gym based KUKA arm manipulator, where our proposed method achieves learning stability within 300 episodes, as compared to the state-of-the-art actor-critic and Advanced Asynchronous Actor-Critic (A3C) algorithms both of which take more than 1000 episodes for learning the same task, showing the effectiveness of our proposed model.(View more).

Bipedal walking in an unknown environment is an extremely challenging problem due to its inherently unstable structure. For bipedal robots, so far, this problem has been tried using analytical methods with predefined sets of parameters, having limited success. Such robots could walk only in a structured environment with a flat floor and other restrictions. This paper proposes such models having learning ability to be imparted to the biped robots based on walking sequences. Two sequential network models have been configured, Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM), to learn the joint trajectories of the hip, knee, and ankle joints, separately and their performances have been studied. Using these two models, learning has been imparted to a robot both in the sagittal plane and in the frontal plane. We use the Pybullet physics engine of a biped robot for simulation purposes. Data have been collected by running several simulations and trained our sequential network models with these collected data. We observe that the present model's walking patterns could be learned both by GRU and LSTM within ten episodes with L2 regularizer. For our present study, the GRU model has several thousand fewer network parameters to learn than the LSTM model. We recommend using GRU as a first choice for controlling the biped robots, walking with simple modes. However, for more complex modes, such as walking with long steps, brisk walking, walking, having push recovery capability, etc., we may need to collect larger data sets for training and testing to suggest which model would be more suitable. Our recommendations are backed by rigorous experimentation. (View more).

Inverse Kinematics (IK) is a mapping from manipulating robot's end-effector (EE) space to its actuator space. Calculating the joint angles from given EE coordinates is a difficult task and different types of geometrical and analytical methods have been proposed to calculate the joint angles. However, such solutions are robot kinematic structure specific and sometimes to obtain analytical solutions Pieper's constraint is imposed, which severally put kinematic constraints which may not be acceptable for humanoid social robots. To help solve this problem, the current research proposes a methodology to calculate the joint angles from given EE coordinates using Reinforcement Learning (RL) for an n-link planar manipulator. Firstly, the workspace of the robot is evaluated using Monte Carlo methods and the states and actions are converted from continuous domain to the discrete domain. Subsequently, the Q-table is updated using State Action Reward State Action (SARSA) algorithm. Conventional RL works well when the number of links is less, like two or three. As the number of links increases the computational cost increases exponentially and the conventional RL Algorithm takes lots of time to learn. In our proposed modified RL algorithm the size of Q-table is significantly less as compared to the conventional RL Algorithm and hence the computational costs are reduced. The proposed algorithm also provides obstacle avoidance capability to the robot for the static obstacles present in robots workspace. The results show an encouraging trend towards substituting IK with learning based models to design and develop social robots of various kinematic shapes free from Pieper's and other constraints.

In the area of Real time Emotion Recognition:

As the human–robot interaction is catching eye day by day with the increase in need of automation in every field, personal robots are increasing in every area which may be coping needs of elderly people, treating autistic patients or child therapy, even in the area of babysitting the child. As robots are helping human being in all such cases, robots need to understand human emotion in order to treat human in a more customized manner. Predicting human emotion has been a difficult problem which is being solved over a decade’s time. In this paper, we have built a model which can predict human emotion from an image in real time. The network build is based on convolutional neural network which has reduced parameters by 90× from that of Vanilla CNN and also 50× from the latest state-of-the-art research carried out to the best of our knowledge. The network build is tested robustly on 8 different datasets, namely Fer2013, CK and CK+, Chicago Face Database, JAFFE Dataset, FEI face dataset, IMFDB, TFEID and custom dataset build in our laboratory having different angles, faces, backgrounds and age groups. The network achieves 74% accuracy which is an improved accuracy from the state-of-the-art accuracy with reduced computation complexity. (View more)

For effective communication between two humans, one needs to understand the emotional state of its fellow beings. Predicting human emotion involves several factors, which include, but not limited to, facial expression. Many researches in this direction are based on features extracted from facial expressions. In this paper we are proposing a model, which can predict human emotion by considering facial as well as context information. Our model not only extracts features from facial expressions but also is aware of the background context in the image dataset. We have shown the relevance of contextual information with facial expression and its impact on the predicted results. Our model captures facial expressions and contextual information with the most relevant part boosted up to capture feature and utilize it to predict human expressions. We are using Attention model in our architecture to boost relevant part, and learn what to boost to make relevant prediction. We have performed several experiments and compare the relevance of facial expressions with context and context free environments. Our proposed model is robust, capable of predicting emotions in real time with improved accuracy of 8% over state of the art accuracy to the best of our knowledge. In addition, it is implemented over dataset, which contains mostly spontaneous images and not posed one, leading to improved results.


List of Sponsored Projects Undertaken/Performed:

In India  

Recent projects 

Project title: "Turning 'Tragedy of the Commons (ToC)' into 'Emergent Cooperative Behavior (ECB)' for Automated Vehicles at Intersections with Meta-Learning,"(DST # TPN/97724, NSF tracking ID# 2343167).  

Past projects



Funding brought  from external agencies for conducting other  outreach projects:

Organized PAC Meeting of the Expert Committee-USERS March 20, 2012- with DST funding Rs 10 Lakhs.

Organized DST-PAC Meeting in the area of Robotics, Mechanical and Manufacturing Engineering at IIIT-A during 4-5 October , 2013.DST Funding Rs 12 Lakhs.

Organized a workshop on Soft Computing from 15-19 May, 2007, DST funding received 10 lakhs. Other sponsorers – Microsoft, DRDO.

Organized a Summer school on Robotics from June 7 -13, 2014, DST funding Rs 10 lakhs. Other sponsorers –Microsoft, Nugenix, National Instruments.

Organized (jointly) Inspire Program and Nobel Laureate Science Conclave , December, 2014, sponsored by DST, Rs 1 crore. 

In USA & Hong Kong

•   From US DOE for research in the area of developing a Mercuric Iodide based semiconductor detector for mobile robots deployed in a radioactive decontamination mission. Project completed. 

•   From Ford Motor Co., Detroit, for research in the area of optimizing process parameters for sheet metal braze welding using neural network. Project completed.

•   Developing controller for single wheel Gyroscopically stabilized robots, The CHUK, Hong Kong


Names of candidates who have already completed and been awarded Ph.D degrees with their Ph.D theses titles:

(Total-14)