I am constantly looking for talented and motivated students willing to endeavour at a research project. The non-exclusive list below contains proposition of baby projects, tuned both to be:
achievable by a bachelor/master student (e.g. as their thesis)
having scientific value (e.g. publishable in the case of a positive outcome)
Contact me: pmilos (at) mimuw.edu.pl or visit my website.
Successful projects so far:
Subgoal Search For Complex Reasoning Tasks - NeurIPS 2021; master project of Marek Zbysiński
Trust, but verify: model-based exploration in sparse reward environments - IJCNN 2021, DRL Workshop, NeurIPS 2020; master project of Michał Izworski
Google football project - undergrad students project co-advised jointly with members of GoogleBrain@Zurich, 2019-2020. The work has been presented at the BayLearn 2020
Developmentally motivated emergence of compositional communication via template transfer - NeurIPS 2019, Emergent Communication: Towards Natural Language Workshop; master thesis of Tomasz Korbak
Expert-augmented actor-critic for ViZDoom and Montezumas Revenge - NeurIPS 2018, Deep RL Workshop; master thesis of Michał Garmulewicz
Update 29-10-2021: Due to time constraints this website is updated rarely. There are many more projects available. Here are my ever growing list1 and list2. These are rather untidy private notes, not guaranteed to be readable. If you interested, perhaps, just contact me for a chat!
Value function is quintessential to model-free methods, but really? In the project, we propose to treat the value function as a generative model predicting future states.
A recent paper advocates merging generative and discriminative architectures. It offers a compelling, simple hybrid architecture, which is superior in both classification and generation.
We adapt these observations to RL setting. For simplicity, consider the Sokoban domain. Given an input, the value function (VF) predicts the distance (=number of steps) to the solution. We suppose that VF can be used to generate inputs (states of the Sokoban game) at a given distance.
This project aims to to verify this hypothesis. If useful, in further research we will study usefulness to downstream tasks (e.g. predicting intermediate states in planning, uncertainty estimation, exploration).
We propose three approaches:
a direct adaptation of energy-based methods
using backpropagation akin to adversarial examples methods
extensions of energy-based methods to continuous predictions
explore possible connections successor representations
Exploration is a central topic in reinforcement learning. For example, if the agent explores too little, it is likely to converge into sub-optimal solutions. If exploration is too aggressive, it might break consistency and make it impossible to discover subtle policies. There are algorithms with theoretical guarantees like E3 or Rmax; however, they rarely can be used in practice. There are many proposed solutions. In this proposal, we are interested in these, which issue auxiliary exploration rewards (e.g., VIME, Curiosity driven exploration). Usually, they measure novelty qualitatively (i.e., as a scalar) and add this scalar as auxiliary rewards. We observe that:
this violates temporal stationarity as the auxiliary rewards change in time (which is a disadvantage)
using multi-step discounting (standard for any RL algorithm), one can achieve consistent long-term exploration (which is an advantage)
In this project, we propose how to keep the benefits of 2 while mitigating the shortcomings of 1. We are inspired by methods introduced recently in Phasic policy gradients. More precisely, we would use two separate heads, one for 'normal rewards' and the other for 'auxiliary ones'. The auxiliary head would be trained with off-policy learning and using the protocol proposed in [Phasic]. The auxiliary head can be trained with a different learning rate enabling faster adaptation to changing rewards.
Generalization is perhaps the most important concept in machine learning. In a sense, it distinguishes learning from 'pure reflection-less' memorization. Arguably, the immense success of modern neural networks stems exactly from their unique generalization properties. Despite research efforts, this phenomenon is still vaguely understood. It has been suggested that planning methods can lead to generalization, interestingly also in the out-of-distribution setting. Broadly speaking, generalization can be perceived as a computational process, which conditioned on input task examples can solve new tasks (possibly even not directly related).
In this project, we aim to investigate generalization on an example of autonomous driving maneuvers delivered by the recent CRTS dataset. The CRTS dataset proses a diverse set of driving scenarios and is a testbed for generalization - the scenarios are partitioned into mutually exclusive training T and evaluation subsets E.
One can imagine (at least) three settings:
pure model-free RL training, in which the generalization is solely due to policy neural network
Planning with a learned model is an often-attempted method. A neural-network model is hopefully able to generalize and thus provide realistic data for a planner. Planning may be able to accommodate for some model errors
planning with an exemplar model in which the "model" is merely the dataset and the whole computational effort of generalization is deferred to planner
In this project, we would like to concentrate on the third attempt. Our scientific goal is to assess the efficiency of this method and provide an interpretable computational model of generalization. Notably, in its simplest version, no neural-networks are used, and the whole computational process is online and can be 'manually' reviewed.
More precisely, for simplicity in the first version, we fix the planner to be random shooting MPC method (possibly with cross-entropy regularisation) - similar to the setup here. A crucial ingredient of the planner is a model. This will be based on examples CRTS, namely, the behavior of all other vehicles will be replayed with a trajectory form CRTS and the ego-vehicle it to be controlled by the CARLA engine. The experimental protocol will test the planner's efficiency on the evaluation examples of CRTS, E, while the planner is allowed to use only training scenarios, T.
There are many implementational details to be fixed during the project. Perhaps the most important one is, which example to take and how many examples to use. Imagine some metric that measures a distance of two road situations (the crudest might be the Jaccard similarity of the bird's-eye view representations). We pick the one which is closest to the current road situation and treat its replay it during planning.
Other metrics can be easily devised, and many trajectories might be used, possibly also mixed (in the ways suggested by here). Finding the best setup is an open question to be studied during the project.
Unsupervised techniques have been percolating through machine learning—the most obvious and common example are auto-encoders. In reinforcement learning, there are interesting uses of contrastive loss. Another, related to this project, PVE is an attempt to extract features of a physical scene using common sense physical prior. An example of such a prior is that the features, e.g., positions, change not too fast and not too slow.
We conjecture that the analogs of these priors might be beneficial for Q-learning. For example, we can easily paraphrase the above prior, Q-function should not change too abruptly, but it also should not stagnate.
In the project, we will verify this hypothesis. We will start with the above example and check if it speeds up learning on RL benchmarks (Atari and Mujoco environments).
A human mind has a fantastic ability to imagine new situations based on experience. A novel reinforcement learning method called episodic-RL uses an associative memory buffer to store the experience in the form of tuples: (state representation, action taken, the cumulative reward at the end of the episode). We want to investigate the possibility of leveraging experience from the memory buffer to imagine novel trajectories and situations. The new method should result in more sample efficient training of such an agent.
The project would start from implementing Neural Episodic Control paper and training it as a baseline on OpenAI Gym classic control, MuJoCo, or Atari 2600 games. The point of this phase is to familiarise ourselves with the episodic-RL algorithm.
In the main phase of the project, we will explore ways to use the memory buffer as a world model and use it to generate more experience. This experience will be used along with the "real experience" to train an RL agent (in a Dyna like manner).
A simulated trajectory will be produced sequentially. Given the current last state of the trajectory, a new transition (the next state for state-action pair) will be sampled from the memory buffer. The first idea is to sample past transitions from the memory buffer according to a distribution proportional to the similarity of states (note that similarity of states is in the core of associative memory used in episodic RL). The second idea is to sample novel transitions from some probabilistic model (e.g. a mixture of Gaussians) fitted to past transitions from the memory buffer where similarity to the current last state is treated as probability.
Thereby, it should be possible to imagine novel trajectories and states by mixing similar transitions met in the real world. This way, we might be able to leverage sample-efficiency of model-based methods in already sample-efficient episodic-RL methods without drawbacks of world models based on function approximators, which often introduce significant approximation error.
Related work:
Reinforcement Learning, Fast and Slow in Trends in Cognitive Sciences.
Dyna: Integrated Planning, Acting, and Learning in Reinforcement Learning: An Introduction 2nd edition by Sutton and Barto.
The project proposed by P. Januszewski.
Planning is an important topic in reinforcement learning studies. Loosely speaking, it proposes to analyze 'the future' to choose appropriate action. To this end, it requires a model of the environment, to enroll 'the future'. In this note, we assume it is given.
We propose to do the analysis above using a model-free algorithm. Such an approach has been introduced already in the case of TD-learning, see here. There is no fundamental obstacle, though to use other algorithms.
In this example, we will learn both policy and value function and use A3C. The learning process is performed on two levels, global and local. The local learning (using A3C) takes a global policy as prior and train a new policy with small/local time horizon T. This policy is used to generate a next action on the global level (during the local learning we use the global value function to bootstrap rest of rewards after the time T). The trajectories on the global level are used to train the global policy and value function.
There are a number of choices, which performance will be evaluated during the project.
Choices:
algorithm for planning (DQN, A3C, DDPG etc.)
architecture for local/global learning. In the simplest case, they are the same. However, it might be desirable that local learning is using a smaller (possibly) much smaller network.
input for local learning. It is not necessarily beneficial that local planning uses the same input. An extreme example is when inputs are just indicators of states. In such a case, TD-leaning might reduce to Monte-Carlo Tree search!
using the global policy/values as prior is rather necessary, but it might be beneficial that local planning 'does not trust them too much' especially the policy.
training global policy/value could use only the global trajectories but it might be beneficial to utilise also local choices.
Bayesian neural networks have many desirable theoretical properties, with the most important one being able to handle uncertainty in a principled way. So far, training big Bayesian neural networks has proven to be very challenging. Nevertheless, for smaller models, the results are auspicious, see e.g. paper for recent advancement in reducing the number of parameters.
Exploration is one of the core topics in reinforcement learning. Parameter space noise for exploration (PSNE) proposes to blend:
off-the-shelve RL algorithms
perturbations of weights of neural-networks approximators
The later is aimed to provide a structured exploration. This approach has many similarities with evolutionary strategies and ... Bayesian methods.
In the project, we propose to explore the intersection of Bayesian learning and parameter space exploration. From the technical point of view, this will involve making the noise variance a learnable parameter (contrarily to the adaptive heuristic used in PSNE). In the second stage of the project, we would like to introduce more learnable noise parameters (PSNE uses only one). Experiments in the paper suggest that, in supervised learning tasks, performance degrades significantly when using only one noise parameter.
Having Bayesian networks as approximators in RL is interesting, not only because of exploration. For example, we speculate that it might better accommodate for moving learning targets naturally present in the RL training process.
Driving in realistic congested situations is still unresolved in autonomous driving research. A recent paper presents an interesting approach to the problem. Firstly, it contains a dataset of real traffic (see picture for its processed representation). Secondly, the authors train a policy using a model-based approach. Their approach has a few desirable properties, among which are a) taking the uncertainty of the model into consideration, b) evaluation protocol, which in particular tests for generalization to unknown scenarios.
In this project, we propose to test to alternative approaches on the extremes of the spectrum:
using pure model-free training
using a model-predictive planner
The goal of the first topic is to quantify, how much a driving can be learned merely by multiple of trials. The major challenge in this task would be to obtain generalization (i.e., good behavior on unseen scenarios). Broadly speaking, generalization is still an open question in model-free reinforcement learning. In the case of automotive tasks, a generalization may help to bridge the so-called sim-to-real gap (observed, for example, in the paper). To alleviate the task, one may consider including (epistemic) uncertainty measurements into training.
In the second task, we plan to test model-based planning techniques, model-predictive control using random shooting and MCTS. Given a perfect model, these methods may ensure safety guarantees, much needed in automotive applications! The primary scientific goal of this task is to measure how much the proposed planning techniques are robust to model errors. Again, we may model uncertainty measurements to ensure this task (e.g., disincentive the planner to enter situations where the model is "uncertain").
Self-imitation (SIL) is a form of reinforcing, particularly lucky/beneficial trajectories, which has shown an impressive result in hard-exploration tasks. In a broader picture, one can perceive the inability to capture lucky but rare behaviour as a result of the widespread use of gradient-based methods. This problem has been noticed and analysed in many works. Particularly interesting is an attempt to make implement episodic memory in control problems (see there also for a description of neuroscience motivations).
In this project, we propose a simple extension of the SIL method. Namely, to use q-learning instead of the PPO algorithm. We speculate that such a change might be beneficial as the SIL method uses replay buffer and is inherently off-policy.
The workhorse of AlphaZero was an MCTS-like algorithm. It contains quite a few improvements over the vanilla MCTS approach. One of these is using a policy network to "guide" the planning search. Intuitively it can be understood as a rough ranking of actions, which allows discarding some search directions at the early stages of planning. This is beneficial, especially in cases when the environment has many actions (and thus, the search tree grows rapidly with the depth).
In AlphaZero the policy was trained with behavioral cloning. In this project, we want to explore other training approaches. In the first step, we plan to utilize the combination of a recently-proposed advantage-weighted regression and bootstrap loss introduced in uncertainty-sensitive planning.
This project is aimed to be at the intersection of four topics:
auxiliary tasks
planning
uncertainty estimation
successor representations.
Successor representations (SR) have been originally proposed a way of improving generalization for TD-learning. They can be viewed as a compressed model of the environment. By design, SR can be used to retrieve a value function using linear combinations. As a simple consequence, they can represent many value functions, related to multiple tasks (or reward structures). Apart for that, SR have been recently proposed as a light-weight mechanism for expressing Bayesian uncertainty.
In this project, we would like to combine with recent advancements in uncertainty-sensitive planning. This work is using Monte-Carlo tree search combined with value functions. In the first step, we will utilize their setup, but we will represent the value function using SR based linear Bayesian framework. We expect that this will give a boost of performance due to improved exploration, but without a heavy computational burden.
In the second step, we will further augment the training with auxiliary tasks. We note that this is particularly easy for SR-based techniques as any task amounts to adding a simple linear transformation.
Generalized/Universal value functions (GVF) has been introduced in the current incarnation in the celebrated Horde paper. They are intended to be a formalism for encoding vast general knowledge about the environment surrounding an agent. (This is in contrary with the vanilla MDP setup in RL, which concentrates on a single task).
These methods have gained much momentum (though sometimes quite implicitly) with the introduction of the hindsight technique.
In this project, we propose to express generalized value function in a factored way. To this end, we need successor representations (SR), another related concept. In a nutshell, given successor representations, one can represent "any" value function via (possibly learnable) linear transformation.
Imagine a robotic arm is tasked with reaching various goals. The goals might be thought of as a parametrization of GVF. In the first step, we propose to learn two neural networks, one for estimation of successor representations and the other for calculating the coefficients of linear transformations corresponding to goals.
In this way, we hope to achieve both useful reusable features given by SR and semantically meaningful goal encoding given by coefficients.
Meta-learning and multitask learning have gained mainstream importance (see, for example, a new benchmark suite); they are hoped to be a way to scale-up reinforcement learning methods. Many algorithms have been proposed, including MAML, REPTILE, PEARL (see also recent work on critical discussion).
Another related area of research is continual learning. In recent work the authors propose to use the so-called hypernetworks in this area. In a nutshell, hypernetworks are networks, which generate weights of other networks, which are supposed to solve downstream tasks (these have been successfully used to a variety of sequence modelling tasks).
In this project, we will explore a natural idea of using hypernetworks to multitask/meta learning. In the simplest case, we will train a hypernetwork to output parameters related to various reinforcement learning tasks based on embeddings. There are many further interesting research directions like the automatic inference of the task embedding using task-dependent predictive modeling, joint learning the network and embedding, exploring the mapping of the embedding space to the task space, or applying meta-learning algorithms on the level of the hypernetwork.
This topic has been suggested by M. Zając.
Many reinforcement learning MDPs can be quite naturally decomposed into a set of (semi)-independent tasks. Take, for example, the celebrated Atari game Montezuma's Revenge; each room constitutes a separate challenge. Alternatively, in a robotic task, grasping an object is often qualitatively different than placing it. Following this observation, a recent work, MEMENTO, proposed to break down the training procedure accordingly. Following Montezuma's example, each time new points are collected, a new context is initiated. Each context has its separate neural network initialized randomly.
Interestingly, the obtained results are much higher, arguably by the fact that the agent forgets its all skills let it be more 'open-minded' in the next room. This mechanism, even though effective, seems too brutal. In the project, we are going to investigate other methods of multitask learning, which retain some knowledge (e.g., it is quite believable that retraining the convolutional part of a network is wasteful). We propose the following methods a) using context as an additional input, b) using progressive networks, c) using gradient surgery, d) or cast it into a meta-learning framework e.g. guided meta-policy search.
This topic falls into a broad field of unsupervised reinforcement learning. Intuitively, we would like to acquire tasks without the proper RL reward. The rational behind is that the usual RL definition is usually too restraint making the training difficult or leading to suboptimal solutions.
One of ways to improve improve RL pipeline is by augmenting it with auxiliary tasks (see paper for one of early examples of this technique and the figure down). The task of defining he auxiliary tasks might be difficult itself and would be best to perform in an unsupervised way. Here we propose to take features of the neural network to be the reward signal.
Concretely, in the first experiment we borrow a network (figure up) for a recent work on successor representation. This work concentrates on prediction of features, we instead propose to control them. This can be obtained for example by multi-task learning.
Added 1 Dec 2019A possible intuitive understanding of meta reinforcement learning is that it is a process of learning fast RL algorithms applicable to some family of task. Put otherwise, one can argue that the MDP formalisation is too broad and thus precludes existence of efficient algorithms. This project is inspired by two, almost simultaneous papers (click pictures). Both the works pursue intriguing idea that an RNN module will learn a fast reinforcement learning algorithm tweaked to a family of task. The RNN module serves also as memory.
In this project we plan to decouple algorithmic and memory part and endow memory with a strong inductive bias. We speculate that this will make learning faster and will allow planning.
Concretely, in the first experiment we plan to mimic an experiment with ViZDoom mazes. In this task the environment is partially visible. We will add memory with a structure two dimensional rectangle (a sheet of paper) on which the agent will be able to draw. Except of basic task of navigation, we will add auxiliary task to reconstruct the map of the maze.
Added 2 Dec 2019Arguably the basic distinction in reinforcement learning is this between model-free and model based methods. The former uses reactive policies and the later plans with a model. Both have their set of strength and weaknesses. For example usually model-free methods achieve better asymptotic performance and model-based methods excel in sample efficiency.
Not surprisingly there are number of attempts to join two methods. For example, one may include "planning-like" structure into neural network structure (see down figure). Another approach is to construct system which imagines possible actions and analyse their consequence. Possibly a most extreme approach is learning a planning module from scratch.
In this project we propose an intermediate solution (compared to model-free planning). Namely, we propose to endow a model-free agent, with an additional action - call a parametrisable planner. Intuitively, it is asking for an advice what to do to achieve a specific goal. Say, we consider a particular example of navigation, such a planner would be able to provide segments intermediate trajectories.
In principle a planner could solve the whole task, this undesirable as planner's quality depends strongly on the quality of the model and computational cost. We conjecture that a model-free agent can accommodate for this weaknesses and learn to efficiently use planners, even when provided a small computational budget.
Added 4 Dec 2019Ability to coordinate is arguably a key factor behind a success of the humankind. Not surprisingly it is also an important topic for the reinforcement learning community. There are some recent successes as capture the flag challenge or learning the ability to use tools. They were obtained using a very high number of training samples (e.g. 0.5 billion of episodes) and various training tricks.
In this project we will tackle more modest tasks, but with the idea of developing a more principled approach. To this end we will develop a suite of coordination tasks of increasing difficulty. We will start with task known from psychological research (e.g. double pong task or herding phenomena). Ultimately, our suite will contain tasks both requiring and not various forms of language communication.
The project will be run jointly with J. Rączaszek-Leonardi and J. Zubek (Faculty of Psychology).
Added 4 Dec 2019When dealing with challenging new RL domains it is helpful to develop tools for addressing strategic and tactical decision-making. These two perspectives complement each other: strategic perspective is global, static, and often imprecise, while tactical perspective is local, dynamic and exact. We argue that value function can be considered as an example of the former, while a planner as an example of the latter. Indeed, value function approximators provide noisy estimates (imprecise) of values (static) to every state (global). Conversely, planning provides optimal control, which starting from a given state (local) generates actions(dynamic) that are temporally coherent and result in a better-executed trajectories (exact). It is only reasonable to combine the two into one system. This is achieved by plugging the value function into the planner, to provide guided heuristics in the local search, shorten the search horizon, and make the search computationally efficient. For complex problems, the above setup has to be supplemented by an exploration strategy. Such a strategy might be based on modeling uncertainty of the value function approximation (e.g., using dropout sampling, variational inference, distributional RL, bootstrap ensemble, etc.). The uncertainty is quantified by a risk measure, which is then utilized by a planner to guide exploration.
Recently, a version of an MCTS algorithm has achieved super-human performance in notoriously hard games, such as chess or go. In the context of single player games, the so-called LevinTS algorithm has attracted some attention. The goal of the project is to fuse ideas from the two aforementioned search algorithms. Additionally, it is hypothesized that the resulting method can be further improved by a suitable choice of exploration strategy.
The project will be overseen by Łukasz Kuciński (IMPAN).
Added 5 Dec 2019Balancing exploration and exploitation is a core concept in reinforcement learning. Exploration increases system knowledge but it costs! Balancing exploration and exploitation is a core RL challenge. Decreasing the exploration level slowly is costly in terms of training time (sample efficiency). Decreasing it fast usually leads to suboptimal behaviours.
The situation gets more difficult in multi-task setting, where various tasks would require various exploration schedules. In fact even single environment can be often decomposed into a set of sub-tasks which should be explored in different speed.
In this project we plan to study this phenomenon and evaluate possible solution. In the first phase we will construct and implement environments which we expect require different exploration levels (e.g. maze-world with bottleneck). In the second phase, we will run various algorithm, observing the behaviour of exploration. We expect that on the hardest environments learning will fail. In the third phase we will develop and test neural network architectures capable of using various exploration levels (e.g. similar to the one used for hierarchical learning, with different noise for each \phi_i).
Project joint with B. Osiński (University of Warsaw and deepsense.ai)
Added 5 Dec 2019