Research
A brief summary of my current research endeavors
Talks/Seminars about my work
CESG Seminar at Texas A&M University [YouTube]
Sajad Mousavi, Ricardo Luna Gutiérrez, Desik Rengarajan, Vineet Gundecha, Ashwin Ramesh Babu, Avisek Naug, Antonio Guillen, Soumyendu Sarkar. Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination. In Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models, NeurIPS 2023
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model’s own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.
Desik Rengarajan, Nitin Ragothaman, Dileep Kalathil, and Srinivas Shakkottai. Federated Ensemble-Directed Offline Reinforcement Learning. In Workshop on Federated Learning and Analytics in Practice, International Conference on Machine Learning, 2023
We consider the problem of federated offline reinforcement learning (RL), a scenario under which distributed learning agents must collaboratively learn a high-quality control policy only using small pre-collected datasets generated according to different unknown behavior policies. Naively combining a standard offline RL approach with a standard federated learning approach to solve this problem can lead to poorly performing policies. In response, we develop the Federated Ensemble-Directed Offline Reinforcement Learning Algorithm (FEDORA), which distills the collective wisdom of the clients using an ensemble learning approach. We develop the FEDORA codebase to utilize distributed compute resources on a federated learning platform. We show that FEDORA significantly outperforms other approaches, including offline RL over the combined data pool, in various complex continuous control environments and real world datasets. Finally, we demonstrate the performance of FEDORA in the real-world on a mobile robot.
Desik Rengarajan*, Sapana Chaudhary*, Jaewon Kim, Dileep Kalathil, and Srinivas Shakkottai. Enhanced meta reinforcement learning using demonstrations in sparse reward environments. In Advances in Neural Information Processing Systems, 2022
Meta reinforcement learning (Meta-RL) is an approach wherein the experience gained from solving a variety of tasks is distilled into a meta-policy. The meta-policy, when adapted over only a small (or just a single) number of steps, is able to perform near-optimally on a new, related task. However, a major challenge to adopting this approach to solve real-world problems is that they are often associated with sparse reward functions that only indicate whether a task is completed partially or fully. We consider the situation where some data, possibly generated by a sub-optimal agent, is available for each task. We then develop a class of algorithms entitled Enhanced Meta-RL using Demonstrations (EMRLD) that exploit this information even if sub-optimal to obtain guidance during training. We show how EMRLD jointly utilizes RL and supervised learning over the offline data to generate a meta-policy that demonstrates monotone performance improvements. We also develop a warm started variant called EMRLD-WS that is particularly efficient for sub-optimal demonstration data. Finally, we show that our EMRLD algorithms significantly outperform existing approaches in a variety of sparse reward environments, including that of a mobile robot.
Desik Rengarajan, Gargi Vaidya, Akshay Sarvesh, Dileep Kalathil, and Srinivas Shakkottai. Reinforcement learning with sparse rewards using guidance from offline demonstration. In International Conference on Learning Representations, 2022
Presented as a spotlight paper at The Tenth International Conference on Learning Representations, ICLR 2022 (top 5.1%) [Paper][Code]
A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.
Kiyeob Lee*, Desik Rengarajan*, Dileep Kalathil, and Srinivas Shakkottai. Reinforcement learning for mean field games with strategic complementarities. In International Conference on Artificial Intelligence and Statistics, 2021
Presented at The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021 [Paper]
In this work, we look at a special case of multi-agent reinforcement learning called mean field reinforcement learning. Mean Field Games (MFG) are those in which each agent optimizes their action with respect to the distribution of states of other players in the system. Systems with a large number of agents can be modeled using this approach. The equilibrium concept here is a Mean Field Equilibrium (MFE), and algorithms for learning MFE in dynamic MFGs are unknown in general due to the non-stationary evolution of the state distribution. We focus on an important subclass that possess a monotonicity property called Strategic Complementarities (MFG-SC). We introduce a natural refinement to the equilibrium concept that we call Trembling-Hand-Perfect MFE (T-MFE), which allows agents to employ a measure of randomization while accounting for the impact of such randomization on their payoffs. We propose a simple algorithm for computing T-MFE under a known model. We introduce both a model-free and a model based approach to learning T-MFE under unknown transition probabilities, using the trembling-hand idea of enabling exploration. We analyze the sample complexity of both algorithms. We also develop a scheme on concurrently sampling the system with a large number of agents that negates the need for a simulator, even though the model is non-stationary. Finally, we empirically evaluate the performance of the proposed algorithms via examples motivated by real-world applications.
Archana Bura, Desik Rengarajan, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. Learning to cache and caching to learn: Regret analysis of caching algorithms. In IEEE/ACM Transactions on Networking, 2021
Published in IEEE/ACM Transactions on Networking [Paper][Arxiv]
Crucial performance metrics of a caching algorithm include its ability to quickly and accurately learn a popularity distribution of requests. However, a majority of work on analytical performance analysis focuses on hit probability after an asymptotically large time has elapsed. We consider an online learning viewpoint, and characterize the regret in terms of the finite time difference between the hits achieved by a candidate caching algorithm with respect to a genie-aided scheme that places the most popular items in the cache. We first consider the Full Observation regime wherein all requests are seen by the cache. We show that the Least Frequently Used (LFU) algorithm is able to achieve order optimal regret, which is matched by an efficient counting algorithm design that we call LFU-Lite. We then consider the Partial Observation regime wherein only requests for items currently cached are seen by the cache, making it similar to an online learning problem related to the multi-armed bandit problem. We show how approaching this caching bandit using traditional approaches yields either high complexity or regret, but a simple algorithm design that exploits the structure of the distribution can ensure order optimal regret. We conclude by illustrating our insights using numerical simulations.
Rajarshi Bhattacharyya, Archana Bura, Desik Rengarajan, Mason Rumuly, Bainan Xia, Srinivas Shakkottai, Dileep Kalathil, Ricky KP Mok, and Amogh Dhamdhere. Qflow: A learning approach to high qoe video streaming at the wireless edge. In IEEE/ACM Transactions on Networking, 2021
Published in IEEE/ACM Transactions on Networking [Paper][Arxiv]
Conference version presented at The twentieth ACM international symposium on mobile ad hoc networking and computing, MobiHoc 2019
The predominant use of wireless access networks is for media streaming applications. However, current access networks treat all packets identically, and lack the agility to determine which clients are most in need of service at a given time. Software reconfigurability of networking devices has seen wide adoption, and this in turn implies that agile control policies can be now instantiated on access networks. Exploiting such reconfigurability requires the design of a system that can enable a configuration, measure the impact on the application performance (Quality of Experience), and adaptively select a new configuration. Effectively, this feedback loop is a Markov Decision Process whose parameters are unknown. The goal of this work is to develop QFlow, a platform that instantiates this feedback loop, and instantiate a variety of control policies over it. We use the popular application of video streaming over YouTube as our use case. Our context is priority queueing, with the action space being that of determining which clients should be assigned to each queue at each decision period. We first develop policies based on model-based and model-free reinforcement learning. We then design an auction-based system under which clients place bids for priority service, as well as a more structured index-based policy. Through experiments, we show how these learning-based policies on QFlow are able to select the right clients for prioritization in a high-load scenario to outperform the best known solutions with over 25% improvement in QoE, and a perfect QoE score of 5 over 85% of the time.
*Denotes equal contribution
Works from my past life
Chirag Ramesh Srivatsa, Desik Rengarajan, and Vamsi Krishna Tumuluru. Modeling demand flexibility of electric vehicles. In 2017 IEEE International Conference on Smart Grid Communications (SmartGridComm), 2017
Nagamani, A. N., Desik Rengarajan, and Vinod K. Agrawal. An optimized design of reversible magnitude and signed comparators. In 2016 IEEE International Conference on Electron Devices and Solid-State Circuits (EDSSC), 2016
Shakthi D. Prasad, Subba B. Reddy, Alok R. Verma, Desik Rengarajan, and Veeresh N. Patil. Influence of Corona Intensity on the Hydropobhicity Recovery of Polymeric Insulating Samples. In International Conference on Condition Assessment Techniques in Electrical Systems (CATCON), 2015