research

Policy Optimization in Dynamical systems

We begin our work in a Multi-arm Bandit setting where the rewards of the arms depend linearly on a common unknown parameter.
The problem is one of pure exploration, and the objective is to find the arm with the largest expected reward while sampling from as few arms as possible.
We propose two algorithms, PELEG and GLUCB, that show excellent sample complexity performance and outperform SOTA algorithms.
We then move on to Policy Optimization in dynamical systems: we consider the setting where a learner is given M base controllers for an unknown Markov Decision Process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
Our algorithms, SoftMaxPG and GradEst (a) work well even when the value function and its gradient are unavailable in closed form and (b) stabilize the dynamical system even when each constituent policy at its disposal is unstable.

We studied the problem of medium access control (MAC) in large wireless sensor networks with multiple sinks. The communicating entities comprising these networks are assumed to possess limited computing power, memory and battery power.
The objective was to propose throughput-optimal scheduling policies for such networks that also possess good packet delay properties and are amenable to decentralized implementation.
We proposed and analyzed multiple scheduling policies that require only the empty-non empty statuses of the packet queues in the network in every time slot to compute the schedule, making them easy to decentralize.
The proposed policies outperformed well-known state information-heavy policies such as α-MaxWeight with respect to delay on several classes of networks.

We studied the problem of low-delay MAC in a collocated wireless network in the same highly resource-challenged setting as the above problem.
Any scheduler that knows the empty-non empty status of every queue in every slot achieves the best possible packet delay performance by serving a non-empty queue (assuming one exists) in every slot. It is, however, not amenable to decentralized implementation.
We proposed scheduling policies that can be implemented using only the knowledge that the queues in the system can obtain by sensing activity (or lack thereof) on the channel. In a limited state information setting, we used the theory of partially observable Markov decision processes (POMDPs) to prove the delay optimality of our policies.
We developed MAC protocols based on these policies. Simulations show that the delay performance of these protocols is very close to that of the aforementioned centralized scheduler.
Implementation of these protocols on sensor motes underway.

Modern computer architectures make a clear split between processing elements (such as CPUs and GPUs) and memory elements (such as DRAMs).
The entity that serves read or write requests coming from the processing element is called a memory controller (memcon).
The memcon needs to keep request-satisfaction latency low, while also respecting the timing constraints of various memory elements.
Queueing-theoretic model-based optimization quickly becomes intractable due to (a) the sheer complexity of the architecture and (b) diversity of QoS demands of various request sources.
We, therefore, would like to use RL-based techniques to train the memcon to optimally satisfy competing requirements of latency and QoS.

We studied the problem of reducing the length of the Cyclic Prefix (CP) in a wireless Orthogonal Frequency Division Multiplexing (OFDM) communication system.
The use of a CP that is shorter than the length of the channel impulse response results in inter-symbol interference which we controlled with the use of an adaptive filter known in the literature as a Time Domain Equalizer (TEQ).
We proposed low-complexity adaptations of these TEQs to handle equalization over time varying channel coefficients (variations caused by Doppler effects, for example).
Simulations on the 3GPP-LTE Spatial Channel Model (Urban Macro scenario) showed that our algorithms can handle Doppler effects caused by mobility with vehicular speeds of up to 150 Kmph.