MLM Reinforcement Learning

A more complete and detailed presentation can now be found in the MLMain.pdf document attached to the home page.

Here is explained with a simple example the basic mechanism that uses the dominance-list memory (DLM) to implement reinforcement learning in the M-Logic Machine (MLM). This highlights some unique features of its machinery and semantics.

Let us consider a simple version of the matching-pennies problem. The human-like machine M plays against a player P. In each turn, both can only choose simultaneously to hold one or two coins in their hands. Player M wins if the number of coins in their hands is found to be equal, and loses if the number of coins is found to be different. For this example, we shall assume the rule for P is to play with two coins every fifty turns, playing one coin in the remaining turns. The pleasure and pain of M is situated in the game, with pleasure when it wins, and pain when it loses. This is the first step to define a "goal" for M. Notice that the notion of pleasure and pain is operational. Pain is just like any other measurement, except for the fact that it triggers specific motor responses in the machine by means of an heuristic.

Obviously, M does not know the future. When possible, the machine M motor actions will be triggered according to the guesses of a belief generator for the immediate future, according to a given heuristic, say H1. The heuristic used also participates in defining the "goal". If the belief generator fails to provide a guess for a motor action, a random generator will trigger a motor action anyway.

Let us assume that the sensory memory of M records a single frame per turn, in two micro-steps. First, it gets the current situation, second, it records the sensed force related to the motor action triggered. In this setting, the number of channels in each frame can be just the two basic dedicated channels required for any reinforcement learning problem. The first channel records the pleasure and pain tags, noted pl and pn, and the second channel records the motor choice of M, 1c or 2c (i.e. producing one or two coins), that will be confronted with the choice of P in the next turn. Recording the sensory information regarding the number of coins is redundant. Let us limit the short-term memory scenes to eight frames. After eight turns, with player B always choosing to produce one coin, the short-term memory (STM) of M could contain, for instance, the following sequence of frames, with the motor choices of M randomly generated. :

STM: [ [pn, ?Motor], [pl, 2c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 2c], [pn, 1c], [pl, 2c] ] (STM1)

The latest motor action is still undefined, and this is indicated with "?Motor". The most recent frame of this scene is to the left. When the limit number of frames is reached, every new frame recorded in the STM will push into oblivion the oldest frame. Since M will search the dominance-list memory (DLM) to generate beliefs, we need to define a STM to DLM record criterion. This criterion also participates in defining the "goal". For instance, we may rule that the STM-DLM record mode will switch from OFF to ON when a pn tag follows two consecutive pl tags, and will change from ON to OFF after a certain size of the DLM scene is reached, say twelve frames. Let us call this recording criterion RC1.

When the record mode switches from OFF to ON, the first thing recorded in a temporary buffer is the STM record, in its totality. This will give M some information about the recent past prior to the recording. We therefore start placing in the DLM input buffer (DLM_IB) a copy of the STM1 scene:

DLM_IB: [ [pn, ?Motor], [pl, 2c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 2c], [pn, 1c], [pl, 2c] ] (DLM1)

As new frames are recorded in the STM, they are added to DLM1 up to twelve frames. When the recording mode gets back to OFF, the buffer content is transferred to the DLM. We may have got, for instance:

STM: [ [pn, ?Motor], [pl, 2c], [pl, 1c], [pn, 1c], [pn, 2c], [pl, 2c], [pl, 1c], [pl, 1c] ] (STM2)

DLM: [ [pn, ?Motor], [pl, 2c], [pl, 1c], [pn, 1c], [pn, 2c], [pl, 2c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 2c], [pn, 1c], [pl, 2c] ] (DLM2)

Now that the record mode went back to OFF, the DLM2 scene becomes available in the DLM for prediction. Let us assume the following heuristic is the only one used by the query generator and belief generator:

H: [ [pl, _ ], [?PnPl_STM, ?Motor] ]_DLM (H1)

The second frame of the heuristic is anchored to the present-time moment. In words, the query generator first looks up in the STM2 most recent frame the actual value of the pleasure and pain frame channel, and brings the result to the working memory (WM). This gives a query for the H1 heuristic:

WM_QG: [ [pl, _ ], [pn, ?Motor] ]_DLM (QG1)

Second, the belief generator looks sequentially in the available DLM scenes, and scans each scene from the most recent frame to the oldest, looking for a match to the heuristic, and brings the result to the WM. Since there is only one scene in the DLM, the first match found is the following belief:

WM_BL: [ [pl, _ ], [pn, 1c] ] (BL1)

The motor action according to BL1 is then triggered and the corresponding force recorded in STM2. This completes the second frame micro-step, and we get a fully determined scene:

[ [pn, 1c], [pl, 2c], [pl, 1c], [pn, 1c], [pn, 2c], [pl, 2c], [pl, 1c], [pl, 1c] ] (STM2a)

We thus see that DLM2 provided a micro-theory for the H1 heuristic, with two micro-rules for H1:

[ [pl, _ ], [pn, 1c] ] (mR1)

[ [pl, _ ], [pl, 1c] ] (mR2)

Remember that only the first match in DLM2 is used to define the micro-rule.

What happens next? Since the motor actions of M are now determined by the belief generator, M will start winning until the fiftieth turn arrives, and B produces two coins. Until then, the record mode stayed OFF, and DLM2 was the only micro-theory available. According to the RC1 rule, when a new pn occurs after two consecutive pl, the record mode is again set to ON. After four more turns, the configuration becomes:

STM: [ [pl, Motor], [pl, 1c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 1c], [pl, 1c], [pl, 1c] ] (STM3)

DLM: [ [pl, Motor], [pl, 1c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c] ] (DLM3)

DLM: [ [pn, Motor], [pl, 2c], [pl, 1c], [pn, 1c], [pn, 2c], [pl, 2c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 2c], [pn, 1c], [pl, 2c] ] (DLM2)

From now on the DLM3 scene will be scanned first, because it is upper in the dominance list memory. It becomes the dominant micro-theory for H1, and DLM2 is forgotten "by interference". But the new micro-rules provided by DLM3 to the H1 heuristic are identical to mR1 and mR2, therefore the machine will simply ignore the exceptional choice of B playing two coins (which is, by the way, an optimal strategy for M when the exact moment of the exception cannot be predicted). The only long-term effect is on the DLM configuration, with some reordering of DLM2, DLM3, and any other scenes similar to DLM3 that will be recorded in the meantime. This happens because each time a scene fails to give an accurate prediction (something discovered by the knowledge acquisition process) it is pushed down in the DLM, while it is pulled up when the prediction is correct. Since each failure starts a new recording, more and more scenes identical to DLM3 are added to the top of the DLM. When the limit of the DLM capacity is reached, the scenes in the bottom start being pushed to oblivion. This is the basic mechanism that implements the machine's reinforcement learning.

Let us now suppose that B, instead of keep playing two coins every fifty turns, gets tired of loosing and decides to invert its playing rule, and starts producing two coins instead of one. From the configuration above, four turns after this new rule of B is started, we get:

STM: [ [pn, Motor], [pn, 1c], [pn, 1c], [pn, 1c], [pn, 1c], [pl, 1c], [pl, 1c], [pl, 1c] ] (STM4)

DLM: [ [pn, Motor], [pn, 1c], [pn, 1c], [pn, 1c], [pn, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c] ] (DLM4)

DLM: [ [pl, Motor], [pl, 1c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c], [pl, 1c] ] (DLM3)

DLM: [ [pn, Motor], [pl, 2c], [pl, 1c], [pn, 1c], [pn, 2c], [pl, 2c], [pl, 1c], [pl, 1c], [pn, 1c], [pl, 2c], [pn, 1c], [pl, 2c] ] (DLM2)

But, alas, DLM4 won't change the micro-rules for H1, and nothing further will be recorded while M keeps using RC1. This means that, even if we keep reordering DLM2, DLM3, and DLM4 according to their predictive accuracy, the machine keeps losing. Reordering the DLM is not enough. The machine needs to unlearn. This is done erasing the micro-theories after a given number of consecutive failures (we can thus tune the machine's "patience in pain"). After the DLM cleaning process, we are back to random moves and the learning process starts again. Notice that, if player B takes a long time to shift its rule, many DLM3-like scenes are found in the DLM, and the cleaning process takes longer. In this relearning setting, the heuristic H1 is good enough in two situations. First, when the rule for B has a periodicity of one or two turns, with a small amount of noise. Second, although the heuristic H1 is unable to detect periodicities larger than that, it may be still good enough when the periodicity is so large that it gives time to the machine to unlearn and relearn good micro-rules for H1. Of course, the shift must be fast enough to avoid compromising the machine's survival. This will depend on the initial survival assets of M and how hostile the world is. In between these two extreme situations, other simple and fast heuristics and recording strategies can be implemented to cover as much as possible the deficiencies of H1.

Another simple strategy to get out of the learning deadlock is to always allow some randomness in the motor actions triggered, even when a belief is successfully generated. This is what I call the "keep-exploring principle", an idea frequently used to escape the curse of local maximums in learning algorithms (in this case it may well induce the machine to leave periodically a global maximum). If we add to the RC1 recording rule another RC2 rule that starts recording after two consecutive pn are followed by a pl, a 2c random choice of M will soon bring to the DLM an adequate micro-theory for H1. This may be faster than DLM unlearning.

Page updated

Google Sites

Report abuse