Iterative Best Response
Iterative Best Response
We consider a double integrator dynamics in the following simulations. Since the number of decision variables increases quadratically in the disturbance history feedback policy parametrization, we took time horizon of the problem as N = 10. Also, we observed that when the desired mean state of the agents are different than 0 and different from each other IBR diverges. This can be easily seen by considering 1-D example where the state and the inputs of each player has 1 dimension. So we considered the cases where the desired distribution for each player is a zero mean Gaussian with different covariances.
In this experiment, we first look at the convergence of the policies. To measure the convergence we compute the difference between policies between consecutive steps of the algorithm. In the following plots, we show the 2 norm of the difference in matrix variables K and L. It can be seen from the plots below, the difference between policies for both players converge after 200 steps. However we can see that the difference term is not monotonic as it can be clearly seen from the plots on the right.
We also observe the trajectory of the state distribution which is shown on the left, the convergence of cost function of player u (in the middle) and player v (on the right). In this example, the desired covariances of both players are smaller than the terminal covariance in the positive definite sense because the covariance of the noise acting on the system makes the desired covariance unreachable for both players. So, the objective of both players is to minimize a a weighted function of terminal covariance. In this setting, we observed that both the policies and the value of cost function of player u is converged. However, the value function of player v oscillates between 2600 and 2200. This seems like an error in the first place since the policy seems to converge but this occurs due the fact that small changes in policy causes big deviations in cost function.
In the setting which we presented the results below, we lowered the intensity of the noise process (noise covariance is lowered) so that desired covariance for both players are reachable. We see that even though the policies converge after 400 iterations. Both objective values oscillate between upper and lower values. Even though we don't see a convergence to a singleton, we observe that the infimum and the supremum of the value functions in the limit converges to their respective upper and lower bounds.
In the next example, the initial covariance is the same as desired covariance of player u and the desired covariance of player v is greater. In this setting, we again set the noise covariance high so that desired covariances are unreachable. In this case, both players policies converges in 250 steps. Also, values of the cost functions converge to Nash equilibrium. One thing that we observed is that the converged values are much smaller (around 75-100) for both players than the previous examples.
In this final example, the initial and the desired covariances are the same as the previous example however the noise covariance is lowered so the desired covariances are reachable in finite time. However, since the objective of the players are opposite to each other like a zero-sum game, the objective functions does not converge to a singleton instead we observe the same phenomena that we see in second example. Also, the convergence of the policies take longer than the other problems (1200 steps). The final covariance is less than the desired covariance of player v and is greater than the desired covariance of player u in positive definite sense.
Discussion of Results: All in all, we observe that naive implementation of iterative best response for this problem does not yield a very good results since the values of objective functions are both very high and convergence is very slow and the objective function values oscillate between two values. However, this method provides more flexibility since it is possible to add more noise to the system through disturbance feedback parametrization.
One method to alleviate the convergence problems is to limit the change in the policy at each iteration so that iterative best response algorithm behaves like a gradient descent. Since it is known that under certain conditions, gradient descent and its variants converge to Nash equilibrium in smooth static games.
Iterative Linear Quadratic Gaussian
We consider a double integrator dynamics for the following simulations.
In this simulation, the control cost is higher for Player 1 but we set different desired means and covariances. The state is steered towards the area in between the two desired distributions, but slightly closer to the desired distribution of Player 2 since his control cost is smaller. The algorithm converges after only 5 iterations.
2. The desired distributions are now symmetric (same control cost and covariance, symmetric means).
In this case, the mean of the system stays the same and the system drives the covariance towards the desired covariance of both players.
3. The desired distributions are now identical (same control cost and mean and covariance).
In this case, the game is a fully cooperative (similar to single-agent system) and the system drives the state towards the desired distribution of both players.
4. Player 1's distribution is now the same as the initial state distribution (same mean and covariance).
In this case, Player 1 wants to force a zero control input to stay at the initial state. Player 2 steers the system towards his desired mean, and the state covariance gets a shape closer to the shape of Player 2’s covariance.
Discussion of Results: This method correctly steers the system towards the desired solution corresponding to a Nash Equilibrium between the two players. With the double integrator dynamics, it converges quickly towards the solution. The LQG method is linear with the number of time steps, while the IBR method is quadratic with the number of time steps.
Further research can be done for high-dimensional systems (state dimension higher than 4 and control dimension higher than 2) to analyze the convergence, the convergence time as well as the impact of the guessed initial control input on the final solution.