To turn the problem into a finite dimensional optimization problem, we first concatenate the states, inputs and noise vectors as shown below:
Using the linearity of the dynamics, we can write the concatenated state vector in terms of other vectors as follows:
Where the matrices in the equation above are derived and shown as follows:
As a control policy, we pick disturbance feedback control policy, which is a common policy structure that is used for linear control problems. This control policy has a one-to-one correspondence with the state history feedback control policy. The reason why we pick disturbance history is that we can express the covariance of the concatenated state vector as a convex quadratic function of decision variables of the policy. The disturbance history feedback policy is precisely described as follows:
Furthermore, the concatenated input vectors for both players are represented in terms of the parameters of the disturbance feedback policy as follows:
Since we represent the policies with finite number of parameters, we can now represent the cost function of each player in terms of these parameters. But first we need to find the expressions of the mean and the covariance of the concatenated state vector. First, we plug the above expressions into the expression of state vector for u and v, and take the expectation of both sides:
Then, the expected value of state x is subtracted from x to find the deviation from the mean. The expression of the deviation is then used to compute the covariance:
Where Y_1 and Y_0 are given as follows and it can be seen that they are affine functions of parameters of the control policies of both players. Since the expression of the covariance of the state is convex quadratic in Y_0 and Y_1. The expression of covariance is a convex function of policy parameters.
The mean and the covariance expressions of control inputs u and v can be expressed as a convex quadratic functions of policy parameters similarly thus control cost terms can also be represented as convex functions of control policy parameters:
If we plug the expression of mean and covariance of both state and control inputs into the objective functions of each player, we obtain the objective functions in terms of policy parameters of each player hence problem is turned into a static smooth game with following objective functions for each player. If one player's policy is fixed the resulting objective function for the other player is shown to be difference of convex function. This allows us to use convex-concave procedure to solve for local optimality.
Algorithm (Iterative Best Response with disturbance feedback policy):
Initialize policy parameters for both u and v.
Fix v and solve difference of convex program for u
Fix u using new solution from step 2 and solve difference of convex program for v
if |u_new - u_prev| < \delta and |v_new - v_prev| < \delta: Return u_new, v_new
else: go back to step 2.