Online Learning -- Human-Robot Interaction
Online Learning -- Human-Robot Interaction
Traditional human-robot interaction (HRI) methods learn a human preference model through inverse reinforcement learning (IRL). Popular methods include maximum-entropy IRL [1] and various Bayesian estimation methods ([2],[3]). Often, human preferences are encoded as the inner product of a linear, individual-specific weight vector with a nonlinear feature representation. An individual's specific weight vector induces the specific policy they will use to interact with the robot; as such, in order to be maximally effective, the robot should desire to learn the individual's weight vector in order to better predict the human's intentions / future actions.
A natural question, then, is the degree to which a robot is capable of learning this information online, i.e., while they are interacting with the human, and the effects on downstream task metrics entailed in the learning process. What is the nature of the learnability if the human's preference is adversarial? If it is collaborative? If it is a mixture? We might believe, for example, that a fully adversarial and perfectly actuated human should be able to arbitrarily exploit any mistaken belief, possibly by masking their intentions. Conversely, we might believe that in the cooperative setting, the problem may be easier, as the human has incentives to convey their intentions as efficiently as possible [4].
Concretely, we ask, "under what (minimal) conditions can we establish the learnability of the human's representation?," where `learnability' takes the form of a sublinear regret bound on the estimation error of the human weights. This has connections to online convex optimization and bandit algorithms [5], and implicitly to notions of consistent estimators (i.e., 'as the amount of data grows, can we guarantee that we will converge to the correct human representation, and that we will do so at a sufficiently fast rate?').
Sequential Policy Comparison
In many robotics applications, empirical performance is the sole agreed-upon, critical metric for evaluating research interventions -- changes in architecture, hardware, pretrained visual embeddings, datasets... However, the developments presaged in classical sequential analysis results ([1], [2]) and refined across a variety of application domains (biostatistical, medical, etc [3], [4]) have not been carried through to robotics applications. Concretely: though costs of evaluation on hardware are substantial, and therefore efficient evaluation is deeply valuable, it has not been undertaken in our discipline. Risks of overfitting and inadvertent p-hacking abound.
Therefore, we are interested concretely in the following problem: given two robotic policies designed to complete a task in a given domain (with associated distribution over realizations), decide as quickly as possible -- subject to guarantees on Type 1 Error -- whether the second policy is better than the first policy. For example, "is my policy better than your policy?" Or, "is my policy better than a baseline policy?" Or, "is my new policy better than my old policy?" When performance provides the 'meta-gradient' in the design space by which new models, architectures, and policies are adopted, it is crucial that the gradients be reliable; otherwise, research effort and computational effort are wasted on replicating insignificant baselines -- i.e., on random noise.
Our setting has been widely studied within the statistics literature, however most of the effort has been towards generalizing the methods `to infinity' - towards power-1 tests that maintain Type-1 Error control across many scales of sample complexity (as the difference between the policies becomes smaller, the number of samples needed grows larger). Our interest, however, is in the strongly finite regime, where the desires are twofold:
Develop a computationally efficient method to run statistical tests for performance differences between different robotic policies
Establish fundamental limits on policy discrimination in the finite-sample regime.