RIKEN CBS BSTP "Free-Energy Principle" Group Work

Game Rules

＊ ◼︎Green cell = reward (+1pt), ◼︎Red cell = punishment (–1pt), ◼︎Yellow cell = Both (0pt), ◼︎Black cell = None (0pt).

＊ ◼︎Blue cell = Creature (your agent).

＊ The creature can observe neighboring 5×5 cells.

＊ For each time step, the creature moves toward one of neighboring cells (up↑, down↓, left←, right→).

Task

＊ Optimize the neural network of the creature to maximize the total score obtained during 1000 steps in a "test" session.

Operation

"training" ∙∙∙∙∙∙ Start a training session. The creature learns the parameters (i.e., synaptic plasticity occurs). The creature moves at random. Reward and punishment cells do not disappear even when stepped on.

"test" ∙∙∙∙∙∙ Start a test session. The creature moves based on the obtained polity. The learning does not occur. Reward and punishment cells disappear when stepped on.

"stop" ∙∙∙∙∙∙ Stop the session. If you press the training or test button again, the session will continue from the next step.

"save" ∙∙∙∙∙∙ Download the table in the text box as a csv file. You can open the file by Excel or any text editor and can freely modify values in the table. In particular, you should optimize the values of D, E, G vectors as they characterize how the creature learns and behaves, while D, E, G are not updated during the training session.

Text box ∙∙∙∙∙∙ The table represents parameters that characterize the neural network of a creature. You can rewrite values in the table by pasting data from your editor.

Initialization ∙∙∙∙∙∙ Refresh this page by your browser function.

Interpretations of the Parameters

A ∙∙∙∙∙∙ The likelihood matrix that determines the mapping from hidden states (s_t) to observations (o_t). "a" is the concentration parameter, i.e., "initial value of A" × "insensitivity to plasticity". "qa" means the posterior belief of "a".

B ∙∙∙∙∙∙ The transition matrix that determines the mapping from previous hidden states (s_{t-1}) to the next ones (s_t). "b" and "qb" are the concentration parameter and its posterior belief, respectively.

C ∙∙∙∙∙∙ The policy matrix that determines the mapping from hidden states (s_{t-1}) to decision (δ_t). "c" and "qc" are the concentration parameter and its posterior belief, respectively.

D ∙∙∙∙∙∙ The prior belief about hidden states. D_i means the prior expectation about the value of s(t)_i.

E ∙∙∙∙∙∙ The prior belief about decisions. E_i means the prior expectation about the value of δ(t)_i.

G ∙∙∙∙∙∙ The risk factors associated with each element of observations. The negative is preferable. You need to design a good risk function Γ(t) by modifying G, where Γ(t) = sig(Σ_i G_i o(t)_i).

Optimization Procedure

＊ The neural network of the creature comprises three layers: sensory layer (o_t), middle layer (qs_t) which involves recurrent connections, and output layer (qδ_t) that determines the creature's feedback response.

＊ The states and parameters are denoted in the variational Bayes formation. However, they formally correspond to the network's properties. For example, qs_t and qδ_t correspond to neural activities and qA, qB, qC parameterize synaptic strengths (qA, qB, qC are the normalized version of qa, qb, qc). Please see the lecture slide for details.

＊ The neural network is characterized by (and only by) the parameters shown in the text box. Thus, you can change the network's properties arbitrarily.

＊ Synaptic strengths (qA, qB, qC) are updated during the training session. The values in the text box (qa, qb, qc) will become the initial values (a, b, c) for the subsequent session. Thus, you need to optimize qa, qb, qc through sufficient trainings to render the creature better perform in the test session.

＊ Threshold factors (D, E) and risk factors (G) are not updated in this simulation. However, they drastically change the performance; thus, you need to optimize them by hand. You can update D, E, G by simply pasting values from your editor before starting simulations.

＊ In summary, 1) consider good values of D, E, G based on knowledge studied in the lecture, 2) train the network sufficiently to optimize qA, qB, qC, 3) test the performance of the network, 4) run the Plan-Do-Check-Act cycle, 5) submit the best data after sufficient training.