In a cart-pole system the agent actuates the cart (mounted on a rail) to get the pole (attached to the cart) to an upright configuration and balance it.
These visualizations help illustrate how the VoSI metric is realized by comparing the difference in performance of mixed loop (MixL) execution strategies (that forgo sensory information for different horizons h and subsequently operate closed loop) from a closed loop execution strategy (denoted as CL). The open loop phase of the rollouts are represented with a gray indicator on the top-right, which the closed loop phase represented with a green indicator.
Media 1(a). These trajectories start from a state when the agent has built-up some momentum to start the swingup phase of the trajectory. We observe that when the agent commits to extended horizons of open-loop execution that the momentum built up is insufficient to put the pole in an upright configuration resulting in a loss in performance when the agent commits to longer horizons of open-loop execution having observed this state.
Media 1(b). These trajectories start from a state where the agent observes a balanced upright configuration of the pole. Observe that a performant agent does not actuate the cart a lot at this point resulting in good retention of performance for horizons upto 50 timesteps, beyond which the agent starts experiencing performance degradation by failing to actuate the cart when it is critical to do so to re-balance the pole upright.
Media 1(c). The agent here observes a state along the swingup phase where the agent has built up sufficient momentum to have the pole reach an upright configuration. Observe that the agent suffers some degradation in performance on operating open loop for an horizon over 80 timesteps as the pole overshoots the desirable upright configuration.
Media 1(d). This is an illustration of trajectories starting from a state in the reset distribution where the cart and pole have low velocities. At this state, the agent can afford extended open-loop action sequences that put the agent in states that correspond to early regions of the swingup phase without experiencing any performance degradation.
Media 2. Visualization of the VoSI profiles over the course of a close loop execution starting from the reset distribution. Observe how towards the end of the rollout that the VoSI profiles appear identical -- suggesting that the agent affords periodic sensing for about every 50 timesteps (0.5 seconds).
From the procedure described in the main-paper for each state s, we have a VoSI(s, h) that indicates the value of sensory information revealing the state of the environment in h steps from now for a performant TD-MPC2 agent. We visualize a low-dimensional summary of the VoSI profiles at different states to understand the global structure of how VoSI profiles vary across different states.
We first project the VoSI profiles obtained for all the evaluated states to a 1-dimensional component i.e. projecting (#S x H -> #S x 1) to enable a low-dimensional visualization of how VoSI profiles change over states. We use PCA to project the data and this projection captures 84.05% of the variance. Figure 1, below shows the raw VoSI(s, h) and the reconstructed VoSI(s, h) from the 1D projection.
Figure 1. Illustration of the amount of structure preserved by the 1D-projection of VoSI profiles.
The reconstruction does capture characteristic features in the raw VoSI profiles and therefore enabling a simple 2D visualization (Figure 2) outlining how the VoSI profiles (characterized by the first principal-component C1) changes over states represented as: [cart.position, pole.angle, cart.velocity, pole.angular_velocity]
Figure 2(a). Reconstructed VoSI profiles from component C1 as it is linearly varied from the smallest value (mapped to 0.0) observed to the largest (mapped to 1.0).
Figure 2(b). Visualization of the states and the corresponding principal component C1 of the VoSI profile at the various states visualized as points in 2D visualizations of the state variables
From Figure 2(b), we observe interesting clustering of states at which VoSI degradation is steeper at S1 where the agent has built up some momentum and is ready to swingup Media 1(a), while degradation is fairly low around states along the starting distribution (S5) Media 1(d). Other regions of states like (S3, S4) the degradation is tied to points at which the pole is expected to reach the upright position Media 1(c). Once the pole is upright and the agent is the balancing phase of the task (S2) the VoSI roughly exhibits periodic characteristics where it might be acceptable to obtain sensory readings revealing the state of the environment every 50 timesteps (0.5 seconds) Media 1(b).