This is a major revision, written from scratch, for the abstract games. The main objective was to allow the user to add abstract games at will, and also allow the user to design an external agent that can play against the MLM.
The main program (in main.py module) gets the agent's information from the games.py module. One can write additional abstract games in the games.py module, simply creating a new class for each new game. The current game classes can be studied and used as templates. Each new game (class) can have a different number of MLM agents. The agent's sensory abilities are defined using dictionaries. The games.py module also defines how the objective world reacts to the agent's actions. The new game classes added to the games.py module are automatically included in the menus.
Additional external agents can be defined in the xxagent.py module. The existing external agents show how the interfaces are defined.
The MLM internal structure and inner predictive processes are found in the mlm40.py module, while the MLM reflex processes are found in the mlmreflexes.py module. The more complex environment evolution laws are coded in the worldo.py module, while the MLM inner objective aspects are coded in the worldi.py module. The remaining modules are for user interface purposes.
When the games are run, I believe the graphics generated are easy to understand. A short explanation of the graphics is included in the Power-Point presentation below, along with many other explanations about the principles and workings of the machine.
When working with Spyder and the IPython console, inline graphics don't work well. For better results, in the Spyder menu, select:
Tools > Preferences > IPython console > Graphics > Backend = Automatic
To run a game in Spyder, I select the main.py tab and press F5. Sometimes, in my Windows 7 implementation, the first tkinter menu window appears as an icon at the bottom of the screen, and needs to be be expanded manually. At the end, I just close all opened graphic windows and use F5 again.
A few minor changes to the MLM prediction and decision mechanisms were made, but the basic idea remains the same. Information is recorded in a list of short cinematic memories, called scenes. Scenes hold sequential information about the perceived states of the world and the agent's own actions. The agent learns as it evolves in the game, first trying random actions. At start, the MLM agents know nothing about what is going on. Its long-term memory is empty. No background knowledge about the rules of the game is provided. All that is assumed is a subjective "good"/"bad" classification of the measured states and actions, called the inner evaluation. The inner evaluation is arbitrary, in the sense that it does not rely on any sensor to detect pain or pleasure. The inner evaluation allows the MLM to give a global evaluation (a global ''good" or "bad") to each scene (or any of its parts), and use the globally "good" and reliable continuations to decide what to do next.
The objective gains and losses for each state are also defined. They are related to the objective survival of the agent in a hostile environment. An adequate inner evaluation of the measured states promotes survival, and therefore it can be found by evolutionary (philogenetic) learning. Assigning inner evaluations, not only to perceived world states, but also to actions, allows finding much faster a stable cooperation between agents.
A slightly modified Power-Point presentation, found hereafter, refers the new ingredients. It still needs some revision of the specific games to better match the Python 3.4.3 implementation. Also, a short paper in the attachment explains how the MLM can implement the Somatic Marker Hypothesis. The basic idea is to introduce a sense of pain related to large objective losses, and update the classification of the actions that lead to those losses.
As always, learning is achieved with the reordering of records in the long-term memory (LTM). The learning mechanism works without statistical manipulations. The reordering of records changes the results of the linear search. The learning mechanism brings to the front what is globally "good" and reliable, while pushing down (ultimately to oblivion) the unreliable predictions. Forgetting is an essential part of MLM learning. The explore/exploit balance is dynamically tuned by a 'satisfaction' feature. Satisfaction increases when the predictions are found correct and the global 'good/bad' evaluation of the short-term memory is positive. When the satisfaction level is high, new sequences are seldom recorded in the long-term memory, and the patterns used to find a match and find a continuation are kept the same.
Besides the inner evaluation of actions, another new feature is the perception of external actions. Each agent not only perceives the world state. It can also focus on some other agent and record its actions. This allows implementing a Tit-For-Tat strategy that respects the Measurement Logic assumptions, and also allows predictive imitation.
One good thing about this implementation is that it's much easier to scale-out (meaning the addition of new features around the core process, like somatic markers or mirror features). Scaling out seems to me a more promising strategy than scaling-up (e.g. increasing the size of memories, or having more sensory dimensions). I believe the core process must remain simple, and that a smart pre-processing of world data is a key to efficient AI.
As before, a few games are offered in the initial menu. I tried to find games with very diverse winning strategies, so the generality of the MLM framework can be tested. We can assign some predefined fixed strategies to each MLM: Tit-for-Tat (TFT), Win-Stay-Lose-Change (WSLC), action with greater averaged reward, and random choice of actions. When the "act randomly" strategy is selected, we can define the frequency of the action '0'. The frequencies of all other available actions will be distributed evenly. For instance, if there are four available actions ('0', '1', '2', '3'), setting action '0' to 25% will generate equal frequencies for the four actions. This allows to see how the MLM plays along with another machine that features biased random actions, or even fixed actions.
The games included are:
Iterated Cooperation
Tow or five agents try to reach the same game state. A small amount of noise can be added to the agent's actions.Two or five agents can be selected. The challenge is that the initial exploratory phase is made of random choices. Each machine acts as noise for the others. But, even in the case of five MLM, the machines soon become inter-predictable and converge to cooperation, typically in less than a hundred steps. Adding some noise makes cooperation more difficult, but the MLM ability to cooperate is quite robust, A larger LTM facilitates the recovery from cooperation failures.
Iterated Matching-Pennies
The two agents can chose heads or tails. The first agent wins when both choices are heads or tails (world state '0'), the second wins otherwise (world state '1'). Several game configurations are given: playing against another MLM, playing against an external agent, playing against a fixed sequence, and playing against a delayed MLM.
When playing against another MLM, the size of the memory is an important factor, because of the "patience in failure" phenomenon. An "expert" that once won a turn may still be the only source of "good" predictions when the other machine inverts that result. It must be pushed down to oblivion, and that takes longer when the LTM is larger. The "patience in failure" therefore increases with the LTM size. In the Minority Game that's a clear advantage, but here it's a disadvantage.
The external agent is a machine that memorizes past sequences of MLM actions (sequences of one, two, or three MLM actions) and then locates the sequence that offers the best chance of predictive success. The external machine acts after guessing the next MLM action, and knowing that it must play the opposite action in order to win. The same mechanism is used to play the Paper-Scissors-Stone game.
The fixed sequence game is very convenient to check how the MLM solves the prediction problem for a specific periodic sequence. The sequence can be much longer than the MLM short predictive horizon (a maximum of three steps ahead). It's interesting to write a long sequence without thinking, and see how easily it is handled by the MLM. A random machine has a death rate of about 20 per thousand steps. The MLM can often do much better than that with human-generated "random" sequences. This shows that our perception of randomness is often illusory. The '01001100011100001111' sequence is one of the hardest for the MLM. The fixed sequence game also allows experimenting with the "superstitious learning" concept.
Playing against a delayed MLM answers the question: What happens if both machines see their opponents choices, but the choice of the first machine will only affect the next turn payoff? The second machine will be able to win if it has a measurement that records the last two actions of its opponent. This also allows to check how the MLM handles time lags against a second machine with a fixed strategy. Also, when the first machine uses the WSLC and the second a TFT, the first machine wins.
Iterated Paper-Scissors-Stone
Basically, this is a variant of the Matching-Pennies game, and the same general remarks apply.
Iterated Chicken-Dare
Imagine two car drivers on a head-on collision course. The one that swerves is the chicken, the one that keeps straight is the daring one. If both dare, they crash. It's the least desired result. If both swerve, it's a tie, a neutral state. If only one swerves (the chicken), it's a desired state for the daring player and a undesired state for the chicken player. In this game, the best solution is obtained when both players classify the tie (i.e. the survival resulting from both swerving) as "good".
Iterated Prisoner's Dilemma
If both prisoners cooperate, they get a small reward. If both defect, both are punished. But the greatest reward and punishment comes when one defects and the other cooperates. Since betrayal is preferred to cooperation (although constant cooperation is better than alternating betrayals), it's harder for cooperation to become the dominant strategy. But it still can become dominant, even in the presence of noise, as long as cooperation is classified as a wanted state. In classical game theory, the objective is to maximize the reward. It is therefore irrelevant if the rewards are all negative, or all positive. The best possible averaged payoff will be chosen. With the MLM, on the contrary, the wanted/neutral/feared classification of states is the defining criterion for the machine's behavior. If defection is classified as "bad", cooperation soon follows. In a standard inner evaluation of states, the possitive value of defecting against the other agent's cooperation more than compensates the negative value of mutual defection. In this case, if the MLM plays against a balanced Cooperation-Defection random choice, the MLM will win, choosing Defection most of the time. Since the MLMs with empty long-term memories start playing randomly, several end results are possible: mutual cooperation can arise; one machine keeps defecting, while the other keeps playing randomly; or both machines can keep playing randomly. The size of the long-term memories is important, because it affects the machine's patience-in-failure. If no other positive continuation can be found, the machine sticks to the unreliable solution that once win, until that solution is pushed down to oblivion. This takes more turns as the memory gets larger.
Iterated Minority Game
Two locations (or rooms) are given, say L1 and L2, and, at each turn, the three agents can choose their location. The agent found alone in one of the two locations wins, while the two others found together in the other location lose. This game also illustrates the importance of patience-in-failure. If players P1 and P2 stay in rooms L1 and L2, respectively (this can be imposed choosing the "Random" basic reflex, and setting the probabilities to choose room "0" to 1 and 0) , the third player P3 will circulate from one room to another, always loosing. The same would happen with a WSLC strategy. P1 and P2 will lose when P3 joins them, but their patience-in-failure will allow them to stay longer in their distinct rooms (or keep changing rooms in a coordinated way). Without any prediction for success, P3 will keep randomly choosing to stay or leave. Of course, P1 and P2 need first to find the favorable distinct room configuration. This may take a while, since they are patient in failure. Also, a long sequence of P3 random 'stays' in room L1, for instance, may push the winning predictive sequences of P1 to oblivion, and P1 will then also start to act randomly.
Iterated Four-Armed Bandit
It's a single-player game. At each turn, the agent chooses one of the bandit levers. The four levers (0,1,2,3) have different winning probabilities: 0.2, 0.4, 0.6, 0.8. The agent's task is to find the best lever. This is a classical and well-studied problem, with many real-life applications.
Iowa Gambling Task
A game similar to the four-armed bandit, but with a much more complex reward structure. Agent one selects a card deck among four card decks, say A, B, C, D. Deck A gives a -20 reward with 0.5 probability, and a +10 reward with a 0.5 probability. Deck B gives a -120 reward with 0.1 probability, and a +10 reward with a 0.9 probability. Deck C gives a +5 reward with 0.5 probability, and and a zero reward otherwise. Deck D gives a -20 reward with 0.1 probability, and a +5 reward with a 0.9 probability. The somatic marker option was written with this game in mind. The MLM reproduces nicely some experimental findings, as explained in the paper included in the attachments.
Iterated Ultimatum Game
At each turn, there is an amount of money to distribute. The first player chooses how to share the amount (five option, taking for him all the amount, 75%, 50%, 25%, or giving it all). The second player accepts or rejects the first player choice. If he rejects the choice, no money is distributed. If he accepts, the money is split according to the first player's choice. To make the game more hostile, I assigned a negative objective gain to the rejection option. Also, the choices are simultaneous, and therefore both players act based on predictions.
Target Ten
A single player game that illustrates the "rewards found beyond a barrier of penalties" problem. The player starts with a cumulative amount of 0 (the minimum amount) and chooses between five available actions ("0" to "4"). Actions "1" and "3" decrease the cumulative amount by one or two units, repsectively, if it's greater than zero. Actions "2" and "4" increase the cumulative amount by one or two units, respectively, if it's smaller than the maximum amount of 20. The machine gets a reward when the cumulative amount is exactly 10, and a penalty otherwise. In this game, all available fixed strategies ("always 1", "always 0", TFT, WSLC, action with maximum averaged gain, random action) are inadequate. This highlights the central role of random exploration for a really generic MLM, and its quick autonomous adaptation to totally different games. We could of course easily design some successful fixed strategy for this specific problem, but the nice thing about the MLM is that we don't need to do that.
Inverted Pendulum
A simple modification of the Target Ten game that simulates an unstable system, similar to the classical inverted pendulum control problem. The cumulative value is calculated not only based on the agent's actions, but also includes the distance to the target value (new_value = 2 * value + action_value - target_value). At each failed attempt to keep the cumulative value within boundaries, it is reset to a value near the unstable equilibrium (i.e. the traget_value).
Street Fighter
Players can decide to attack or defend, striking and defending high or low. If both players strike, they both are damaged. If one player strikes high (low) and the other defends high (low), no damage is inflicted. If one player strikes high (low) and the other defends low (high), the defender is damaged. While striking or defending, no player gets a positive payoff. It's a lose-lose game, and the aim is to die less often. Because not very much is "good", few predictions are selected, and random choices occur often. When one agent dies, the other gets a victory prize. It's the only positive reward.
Besides these games, there is also a simple implementation of the Bridge World Metaphor (BWM). The metaphor is presented and explained in the Power Point file found hereafter. The BWM implementation allows experimenting with the notion of integrated intelligence, that relies on clever sensors, adequate basic reflexes, cinematic reasoning, post-cinematic reflexes, adequate actuators, etc.
The current implementation works on basic reflexes only, with very simple sensory-motor abilities. To understand the animation that is produced at the end of each run, keep in mind that:
The MLM virtual robot (MLM-VR) features five actions: Do nothing (i.e. wait); accelerate; turn back; turn left; brake. With the exception of doing nothing, all actions require a minimum level of fuel supply.
The MLM-VR is equipped with two sensors: One in the body, and one ahead of the body, in the direction the MLM-VR is moving. They detect the types of ground over which the sensor is located (i.e. bridge, feeding region, void).
The MLM-VR is equipped with reflex sequences of actions. Each sequence is written as a list of elementary actions that is performed repeatedly as long as the MLM-VR stays in the same sensory state. Of course, the sequence can be a singleton.
The MLM-VR starts refueling when the fuel level reaches a low level, and stops refueling when its fuel level reaches the maximum value. The increasing fuel level is shown with the increasing diameter of the circle that represents the MLM-VR body.
The fuel supply is a rectangular zone (the fuel region, FR) that moves around a center point in the bridge, while trying to avoid the MLM-VR. The FR becomes more agile as it is depleted of fuel. The gradual depletion happens when the MLM-VR is refueling over the FR, and it is shown with the increasing transparency of the FR. Without the MLM-VR over it, the FR gradually recovers its fuel level.