Exploring the predictable

This work presents a framework for artificial intelligence (AI) systems to enhance their exploratory capabilities by including surprise, curiosity, and competition in the modules of the system. The article presents a novel way of integrating various existing and well-known concepts to build a curious and exploratory framework that tries to learn and predict some abstract internal representations of the environmental state which are neither trivial (boring), nor random (thus unpredictable).

The key point of the article is the concept of embedding and using surprises for the desired learning and prediction capability. The article points out that ‘surprise’ comes from the happening of unexpected, when something else was expected (thus, predicted). Thus, surprise has two components: 1) something has been predicted (confidently) as likely to happen, 2) the unexpected happens and prediction does not come true. The article also notes that seldom ‘one set of input’ is surprising. The surprise often comes in the form of an event (a sequence of surprising input sequences). In fact, predictable and surprising aspects of an event are often spatio-temporal abstraction of the input sequence. This means that neither all the well defined parameters of an input, nor a complete input sequence is surprising. What is surprising (or predictable) are some abstractions (or abstract patterns/aspects) of the input sequence.

Conventional approaches, problems and limitations:

The article then discusses the limitation of the conventional exploratory approaches. The conventional approaches attempt to select the training examples that maximized the traditional information gain (usually referred to as the Shannon information). Such systems use simple reinforcement learning where an active predictor attempts to predict all the parameters of the next input (environmental state) based on the current input and the current action sequence (given by an action generating module). If the predictor fails, action generating module is rewarded. Thus, action generating module is always inspired or motivated to provide action sequences that result into unpredictable (and hence new and informative) environment state.

There are two main aspects in which such conventional approaches are limited. First, since the action generating module is awarded for any unpredictable outcome, some action generating sequence might prefer white noise due its unpredictable nature. Such white noise is of no consequence to the learning and prediction procedure, as it does not contribute towards enhancing the knowledge or prediction capability. Second, the predictor typically generates a complete set of input parameters, out of which only a few parameters (or some kind of abstraction of them) are actually computable and predictable. However, since the predictor is generating the complete sequence, even if it made correct prediction about the predictable parameters, it might be declared a failure due to incorrect prediction of the rest of the parameters, which were not computable in the first place. Further, it is a waste to divert away from computable and predictable things and predict all the parameters, which may not be information rich or easily interpretable/learnable. The article also identifies some of the possible learnable regularities:

i. Observation/Prediction: selected environmental state inputs may match the result of some ‘previous’ internal computation.

ii. Observation/Explanation: selected ‘memorized’ environmental states may match some ‘later’ computations.

iii. Planning/Acting: Outcomes of ‘previous internal computations’ and decided ‘actions’ may result in the environmental states ‘desired’.

iv. Internal regularities: Internal computations meeting expectations.

v. Combination of any or all of the above.

Issues and ideas:

After identifying the problem with the conventional approaches and learnable regularities, and deducing that the predictor needs to predict only a few abstract features (which might be internally representable in some form) deducible from the input sequence, and that there is a need of a method that can discard instances of white noise, the article presents the following issues regarding the requirements:

1) How does one extract novel predictable concepts?

2) Which novel input-transforming algorithms indeed produce the internally representable abstract features (that might be useful for prediction)?

3) How to discover novel spatio-temporal regularities among various random and unpredictable things (most of which should be ignored)?

To answer these questions, the article proposes some basic and simple but important ideas:

· Allow the agent to have the following capabilities: a) composing algorithms that permit construction of Universal Turing Machine, b) an instruction set that can access current input, modify environment by deciding actions, modify an internal memory with many addressable cells c) arithmetic computation and conditional jump capability.

· The internal representation may be as abstract as the contents of a particular memory cell, or some algorithmic computation of some internal memory cells.

· Let there be a capability to express confidence in a prediction via a ‘Bet’ statement. The loser of bet is surprised, since it had confidence while betting, and will learn.

· Let there be an internal reward for ‘pure exploratory’ instinct, where a module who is able to learn surprising things and surprise other modules is rewarded.

· Form a co-evolving pair of modules (the scheme is mentioned below), with the above capabilities. Such a pair can be modeled to take care of the three issues (discussed above) automatically.

Co-evolving modules scheme:

Two modules, X and Y, which have the above capabilities and are essentially similar, are considered. Initially each of them has zero internal reward. X and Y decide together on a set of instructions (algorithm) that they will execute. Both predict the outcome Sx and Sy respectively. If Sx = Sy, the prediction is considered trivial (already known and boring). The predictions not being same, Sx ≠ Sy, forms the premise of ‘surprise’ for one of them. After the execution of the agreed algorithm, if Sx comes true, X gets a reward, and Y gets a punishment of the same magnitude (and vice versa). This ensures that total internal reward of the system for demonstrating curiosity is zero. Thus, though the external reward may improve/deteriorate due to the knowledge gained out of curiosity, curiosity itself becomes an internal nature and not an external reward. The winner’s ‘sequence of module modifications’ are then copied to the loser such that they have the same knowledge regarding the agreed algorithm and its prediction. The incremental self-improvement (IS) algorithm keeps track of the success stories. If there is no incremental improvement in reward over various check points, it removes the ‘sequence of module modifications’ (SMM) corresponding to the current checkpoint and retains the knowledge upto the previous checkpoint.

Salient features of such scheme are:

1. An outcome is considered novel as long as one of the two modules find it surprising.

2. The fact that they predict an outcome establishes that they have confidence of knowing the outcome. Thus, non-predictable things like white noise can be prevented inherently.

3. The agreement on the algorithm ensures that both agree on the information and computation involved, and the ‘Bet’ is fair. Also, the copying of winning modules sequence ensures that the information once learnt is available to both, so that the winning module does not keep on surprising the loser and rather both can focus on finding something new.

4. Since every module wants the reward, every module tries to come up with novel information (rather than something already learnt or very random) and to lure the other module into agreeing to the algorithm. Every module is also motivated to reject/disagree upon the algorithm that may not lead to a favorable result. Thus, both modules focus on learning simple new learnable things rather than random or trivial things.

5. The acceptance of bet ensures that the agreed algorithm inherently takes care of the three issues pointed out above.

6. The IS scheme ensures that the modules are not simply surprising each other and learning regularities randomly, rather they are really improving their learning capabilities. In the IS chosen, each SMM is held responsible for the quality of future SMMs. Thus, if the SMMs following an SMM perform poorer incrementally, the SMMs are removed one by one till the root of poor SMMs gets removed.

Experimental results:

The experiment in which there is no external reward demonstrates that each module demonstrates both the behaviors: surprising and surprised. However, initially, surprises are lesser and one module outperforms the other for a longer time. Later, both become efficient in surprising each other quite often, indicating that both modules have improved a lot. In the absence of external reward, curiosity is the only form of reward. Thus, ‘Bet’ is the main instruction carried out by the modules. In the experiment with external rewards, with no initial knowledge, the modules demonstrate curious and exploratory nature for initial duration, which is followed by grabbing external rewards (once the existence of these rewards and exploratory information to grab these rewards is known). Due to this, such scheme outperforms conventional scheme most of the time. However, if there are string external rewards, in the long run, the exploratory advantages of curiosity are lost and the modules become more and more goal oriented and perform similar or poorer to conventional schemes.

The article successfully implements and demonstrates curious agents that learn by exploring and surprises. However, if curiosity is indeed useful varies from case-to-case, though it is generally observed that curious systems have more knowledge to interact with the world in a desirable manner.