Student's research projects

I am constantly looking for talented and motivated students willing to endeavour at a research projects. The non-exclusive list below contains proposition of baby projects, tuned both to be:

  • achievable by a bachelor/master student (e.g. as their thesis)
  • having scientific value (e.g. publishable in the case of a positive outcome)

Contact me: pmilos (at) or visit my website.

Guiding planning with polices

The workhorse of AlphaZero was an MCTS-like algorithm. It contains quite a few improvements over the vanilla MCTS approach. One of these is using a policy network to "guide" the planning search. Intuitively it can be understood as a rough ranking of actions, which allows discarding some search directions at the early stages of planning. This is beneficial, especially in cases when the environment has many actions (and thus, the search tree grows rapidly with the depth).

In AlphaZero the policy was trained with behavioral cloning. In this project, we want to explore other training approaches. In the first step, we plan to utilize the combination of a recently-proposed advantage-weighted regression and bootstrap loss introduced in uncertainty-sensitive planning.

Added 15 Jan 2020

Planning with uncertain auxiliary tasks

This project is aimed to be at the intersection of four topics:

  • auxiliary tasks
  • planning
  • uncertainty estimation
  • successor representations.

Successor representations (SR) have been originally proposed a way of improving generalization for TD-learning. They can be viewed as a compressed model of the environment. By design, SR can be used to retrieve a value function using linear combinations. As a simple consequence, they can represent many value functions, related to multiple tasks (or reward structures). Apart for that, SR have been recently proposed as a light-weight mechanism for expressing Bayesian uncertainty.

In this project, we would like to combine with recent advancements in uncertainty-sensitive planning. This work is using Monte-Carlo tree search combined with value functions. In the first step, we will utilize their setup, but we will represent the value function using SR based linear Bayesian framework. We expect that this will give a boost of performance due to improved exploration, but without a heavy computational burden.

In the second step, we will further augment the training with auxiliary tasks. We note that this is particularly easy for SR-based techniques as any task amounts to adding a simple linear transformation.

Added 9 Jan 2020

Factorised generalized value function

Generalized/Universal value functions (GVF) has been introduced in the current incarnation in the celebrated Horde paper. They are intended to be a formalism for encoding vast general knowledge about the environment surrounding an agent. (This is in contrary with the vanilla MDP setup in RL, which concentrates on a single task).

These methods have gained much momentum (though sometimes quite implicitly) with the introduction of the hindsight technique.

In this project, we propose to express generalized value function in a factored way. To this end, we need successor representations (SR), another related concept. In a nutshell, given successor representations, one can represent "any" value function via (possibly learnable) linear transformation.

Imagine a robotic arm is tasked with reaching various goals. The goals might be thought of as a parametrization of GVF. In the first step, we propose to learn two neural networks, one for estimation of successor representations and the other for calculating the coefficients of linear transformations corresponding to goals.

In this way, we hope to achieve both useful reusable features given by SR and semantically meaningful goal encoding given by coefficients.

Added 9 Jan 2020

Meta RL/multitask RL with hypernetworks

Meta-learning and multitask learning have gained mainstream importance (see, for example, a new benchmark suite); they are hoped to be a way to scale-up reinforcement learning methods. Many algorithms have been proposed, including MAML, REPTILE, PEARL (see also recent work on critical discussion).

Another related area of research is continual learning. In recent work the authors propose to use the so-called hypernetworks in this area. In a nutshell, hypernetworks are networks, which generate weights of other networks, which are supposed to solve downstream tasks (these have been successfully used to a variety of sequence modelling tasks).

In this project, we will explore a natural idea of using hypernetworks to multitask/meta learning. In the simplest case, we will train a hypernetwork to output parameters related to various reinforcement learning tasks based on embeddings. There are many further interesting research directions like the automatic inference of the task embedding using task-dependent predictive modeling, joint learning the network and embedding, exploring the mapping of the embedding space to the task space, or applying meta-learning algorithms on the level of the hypernetwork.

This topic has been suggested by M. Zając.

Added 6 Jan 2020

From many to one and back

Many reinforcement learning MDPs can be quite naturally decomposed into a set of (semi)-independent tasks. Take, for example, the celebrated Atari game Montezuma's Revenge; each room constitutes a separate challenge. Alternatively, in a robotic task, grasping an object is often qualitatively different than placing it. Following this observation, a recent work, MEMENTO, proposed to break down the training procedure accordingly. Following Montezuma's example, each time new points are collected, a new context is initiated. Each context has its separate neural network initialized randomly.

Interestingly, the obtained results are much higher, arguably by the fact that the agent forgets its all skills let it be more 'open-minded' in the next room. This mechanism, even though effective, seems too brutal. In the project, we are going to investigate other methods of multitask learning, which retain some knowledge (e.g., it is quite believable that retraining the convolutional part of a network is wasteful). We propose the following methods a) using context as an additional input, b) using progressive networks, c) using gradient surgery, d) or cast it into a meta-learning framework e.g. guided meta-policy search.

Added 27 Dec 2019

Feature control as auxiliary tasks

This topic falls into a broad field of unsupervised reinforcement learning. Intuitively, we would like to acquire tasks without the proper RL reward. The rational behind is that the usual RL definition is usually too restraint making the training difficult or leading to suboptimal solutions.

One of ways to improve improve RL pipeline is by augmenting it with auxiliary tasks (see paper for one of early examples of this technique and the figure down). The task of defining he auxiliary tasks might be difficult itself and would be best to perform in an unsupervised way. Here we propose to take features of the neural network to be the reward signal.

Concretely, in the first experiment we borrow a network (figure up) for a recent work on successor representation. This work concentrates on prediction of features, we instead propose to control them. This can be obtained for example by multi-task learning.

Added 1 Dec 2019

Structured memory meta reinforcement learning

A possible intuitive understanding of meta reinforcement learning is that it is a process of learning fast RL algorithms applicable to some family of task. Put otherwise, one can argue that the MDP formalisation is too broad and thus precludes existence of efficient algorithms. This project is inspired by two, almost simultaneous papers (click pictures). Both the works pursue intriguing idea that an RNN module will learn a fast reinforcement learning algorithm tweaked to a family of task. The RNN module serves also as memory.

In this project we plan to decouple algorithmic and memory part and endow memory with a strong inductive bias. We speculate that this will make learning faster and will allow planning.

Concretely, in the first experiment we plan to mimic an experiment with ViZDoom mazes. In this task the environment is partially visible. We will add memory with a structure two dimensional rectangle (a sheet of paper) on which the agent will be able to draw. Except of basic task of navigation, we will add auxiliary task to reconstruct the map of the maze.

Added 2 Dec 2019

Model-free agent with planner

Arguably the basic distinction in reinforcement learning is this between model-free and model based methods. The former uses reactive policies and the later plans with a model. Both have their set of strength and weaknesses. For example usually model-free methods achieve better asymptotic performance and model-based methods excel in sample efficiency.

Not surprisingly there are number of attempts to join two methods. For example, one may include "planning-like" structure into neural network structure (see down figure). Another approach is to construct system which imagines possible actions and analyse their consequence. Possibly a most extreme approach is learning a planning module from scratch.

In this project we propose an intermediate solution (compared to model-free planning). Namely, we propose to endow a model-free agent, with an additional action - call a parametrisable planner. Intuitively, it is asking for an advice what to do to achieve a specific goal. Say, we consider a particular example of navigation, such a planner would be able to provide segments intermediate trajectories.

In principle a planner could solve the whole task, this undesirable as planner's quality depends strongly on the quality of the model and computational cost. We conjecture that a model-free agent can accommodate for this weaknesses and learn to efficiently use planners, even when provided a small computational budget.

Added 4 Dec 2019

Let us coordinate

Ability to coordinate is arguably a key factor behind a success of the humankind. Not surprisingly it is also an important topic for the reinforcement learning community. There are some recent successes as capture the flag challenge or learning the ability to use tools. They were obtained using a very high number of training samples (e.g. 0.5 billion of episodes) and various training tricks.

In this project we will tackle more modest tasks, but with the idea of developing a more principled approach. To this end we will develop a suite of coordination tasks of increasing difficulty. We will start with task known from psychological research (e.g. double pong task or herding phenomena). Ultimately, our suite will contain tasks both requiring and not various forms of language communication.

The project will be run jointly with J. Rączaszek-Leonardi and J. Zubek (Faculty of Psychology).

Added 4 Dec 2019


When dealing with challenging new RL domains it is helpful to develop tools for addressing strategic and tactical decision-making. These two perspectives complement each other: strategic perspective is global, static, and often imprecise, while tactical perspective is local, dynamic and exact. We argue that value function can be considered as an example of the former, while a planner as an example of the latter. Indeed, value function approximators provide noisy estimates (imprecise) of values (static) to every state (global). Conversely, planning provides optimal control, which starting from a given state (local) generates actions(dynamic) that are temporally coherent and result in a better-executed trajectories (exact). It is only reasonable to combine the two into one system. This is achieved by plugging the value function into the planner, to provide guided heuristics in the local search, shorten the search horizon, and make the search computationally efficient. For complex problems, the above setup has to be supplemented by an exploration strategy. Such a strategy might be based on modeling uncertainty of the value function approximation (e.g., using dropout sampling, variational inference, distributional RL, bootstrap ensemble, etc.). The uncertainty is quantified by a risk measure, which is then utilized by a planner to guide exploration.

Recently, a version of an MCTS algorithm has achieved super-human performance in notoriously hard games, such as chess or go. In the context of single player games, the so-called LevinTS algorithm has attracted some attention. The goal of the project is to fuse ideas from the two aforementioned search algorithms. Additionally, it is hypothesized that the resulting method can be further improved by a suitable choice of exploration strategy.

The project will be overseen by Łukasz Kuciński (IMPAN).

Added 5 Dec 2019

Multi-task exploration and multi-task learning

Balancing exploration and exploitation is a core concept in reinforcement learning. Exploration increases system knowledge but it costs! Balancing exploration and exploitation is a core RL challenge. Decreasing the exploration level slowly is costly in terms of training time (sample efficiency). Decreasing it fast usually leads to suboptimal behaviours.

The situation gets more difficult in multi-task setting, where various tasks would require various exploration schedules. In fact even single environment can be often decomposed into a set of sub-tasks which should be explored in different speed.

In this project we plan to study this phenomenon and evaluate possible solution. In the first phase we will construct and implement environments which we expect require different exploration levels (e.g. maze-world with bottleneck). In the second phase, we will run various algorithm, observing the behaviour of exploration. We expect that on the hardest environments learning will fail. In the third phase we will develop and test neural network architectures capable of using various exploration levels (e.g. similar to the one used for hierarchical learning, with different noise for each \phi_i).

Project joint with B. Osiński (University of Warsaw and

Added 5 Dec 2019