Induced Modularity and Community Detection for Functionally Interpretable Reinforcement Learning
 Â
 Â
Summary Interpretability in reinforcement learning is crucial for ensuring AI systems align with human values and fulfill the diverse related requirements including safety, robustness and fairness. We demonstrate how the penalisation of non-local weights leads to the emergence of functionally independent modules in the policy network of a reinforcement learning agent. Through the novel application of community detection algorithms, we automatically identify these modules and verify their functional roles through direct intervention on the network weights prior to inference. This level of functional decomposition strongly aligns with human decision making frameworks, thus addressing challenges regarding the trade-off between completeness and cognitive tractability of reinforcement learning explanations.
We train a PPO agent on the dynamic obstacles gridworld, where an agent (red) must avoid the moving obstacles and reach the goal square. The initial MLP based network is visualised above, with each neuron and layer assigned linearly spaced positions along the x and y axes respectively. The input features GX and GY give the coordinates of the goal relative to the agent, while BiX and BiY give the relative coordinates of ball i.
We modify training by introducing a connection cost loss which penalises weights proportional to the distance between the neurons they connect. The position parameters of the neurons are also iteratively exchanged during training to further minimise the overall connection cost of the network. In our dynamic obstacles environment this results in the emergence of parallel modules for the selection of movements along the x and y axes. As shown below, right, the weights in each module can be automatically categorised based on clustering observed on the second eignevector tof the network's adjacency matrix.
Tap on a node to trace its connectivity through the network, or on the graph background to reset.
Automatic clustering of the weights in this manner enables us to intervene on each module independently by modifying the network parameters prior to inference. The animations below show the results of replacing all parameters with -5 in module 1 (left) and module 2 (right). It can be observed empirically and within the action statistics that intervening on module 1 in this manner removes the ability to select up and down actions, while retaining the ability to move towards the goal and not onto obstacles through taking left and right actions. The reverse is observed when intervening on module 2.
Drawing on principles from cognitive neuroscience, we propose that decomposing a decison making network into functional modules in this manner offers a cognitively accessible and scalable framework for interpretability.