EXPLOration
- EXPLOitation
for efficient Resource
Allocation. Our proposal deals with the question of how to make the best possible use of available resources in order to optimize the performance of some decision-making task. In the case of simulated scenarios, the term resource refers to a piece of computational effort (for example CPU time, memory) devoted to the realization of some computation. Nonetheless, we will also consider the case of real-world scenarios where the term resource denotes some effort (real-world experiment) that has a real, e.g. financial, cost. Making a good use of the available resources means designing an exploration strategy that would allocate the resources in a clever way such as to maximize (among the space of possible exploration strategies) the performance of the resulting task. Potential applications are numerous and may be found in domains where a one-shot decision or a sequence of decisions has to be made, such as in optimization, control, learning, and games. For that purpose we will consider several ways of combining algorithms which perform a good job in balancing resources between exploitation (making the best decision based on our current, but possibly imperfect, knowledge) and exploration (decisions that may appear sub-optimal but which may yield additional information about the unknown parameters, and, as a result, could improve the relevance of future decisions). These exploration/exploitation algorithms, also called bandit algorithms, or regret-minimization algorithms, will be the building blocks of our methods. They will be combined either in a hierarchical way, or as a population, either in collaborative or adversary working mode. A motivating example concerns min-max tree search in large scale games. The goal here is to explore the tree to find the best move for the next play, given a limited amount of simulation resources (e.g., CPU time). Here, resource allocation means an exploration strategy that selects which branch one should explore deeper at each time step; the aim being that at the end of the available resources, the collected information allows making the best decision (or an almost optimal decision). Previous works in efficient tree exploration using hierarchical bandits for the game of go have shown very promising results (such as the MoGo program [Gelly et al., 2006] currently among the world best computer-go programs), which have motivated our research for extending both the theoretical analysis of the underlying ideas and their scope to a wide range of applications. We expect to develop new simulation techniques based on a clever use of available computational resources, in order to solve large scale optimization and decision making problems previously considered unsolvable. Our contributions will be theoretical (convergence issues, regret bounds), algorithmic (the design of algorithms adapting automatically to the unknown underlying structure of the problem), and numerical (we aim at solving real world large scale problems). This is a fundamental research project. It brings together academic partners covering a broad spectrum of expertise, from the most theoretical aspects (statistics, optimal control, game theory, statistical learning, decision theory, stochastic processes) to more applied and experimental skills (game programming, parallel computing). We believe that such a broad collaboration is mandatory to undertake this project. Although our research is motivated by solving large scale and, eventually, industrial applications, we aim at introducing and analyzing first new fundamental tools and methods; we believe that these are not yet at an industrial stage of development, which explains the fundamental-research aspect of this project. Keywords: resource allocation, numerical simulation, exploration / exploitation dilemma, regret minimization, bandit algorithms, population of bandits, learning from experts, tree and graph search, sequential decision making under uncertainty, optimization, game theory, reinforcement learning, multi-agent learning. |