In this section, we take one system (ACC) from our benchmark to illustrate the complete process that how we select this system, design the DRL controller, evaluate the performance, and discuss the findings.
This system is released by MathWorks in Model Predictive Control Toolbox. It represents a classic scenario in autonomous driving, namely, adaptive cruise control. An ego vehicle equipped with a sensor, such as radar, that measures the distance to the preceding vehicle in the same lane (lead car), d_safe. The sensor also measures the relative velocity of the lead car, v_rel. The ACC system operates in the following two modes:
Speed control: The ego car travels at a driver-set speed.
Spacing control: The ego car maintains a safe distance from the lead car.
ACC environment
We select this system into our benchmark since ACC is a commonly used system in design and testing, and it reflects the complexity of industrial-level CPS, namely, it has two control objectives and the controller needs to balance its behaviours to satisfy two goals simultaneously. In addition, as ACC is released by MathWorks, we can find the complete Simulink and detailed documentation to analyze the system requirements and environment configurations. Thus, this system meets all the selection criteria we set and is representative of a specific industry domain, automated driving.
The ACC system decides which mode to use based on real-time radar measurements. For example, if the lead car is too close, the ACC system switches from speed control to spacing control. Similarly, if the lead car is further away, the ACC system switches from spacing control to speed control. In other words, the ACC system makes the ego car travel at a driver-set speed as long as it maintains a safe distance.
The following rules are used to describe the ACC system operating behaviours:
If d_rel >= d_safe, then speed control mode is active. The control goal is to track the driver-set velocity, v_set.
If d_rel < d_safe, then spacing control mode is active. The control goal is to maintain the safe distance, d_safe.
With a complete understanding of system goals, we can start to deploy the DRL controllers to replace the original MPC controller.
Similar to the original MPC controller, DRL controllers take the following 5 environment values to build the observations state which is used to describe the system conditions.
User-set velocity: v_set
The relative distance between two cars: d_rel
The relative velocity between two cars: v_rel
The real-time safe distance: d_safe
The ego car velocity: v_ego
The safe distance between the lead car and the ego car is a function of the ego car velocity, v_ego: d_safe = d_default + T_gap * v_ego, where d_default is the standstill default spacing, T_gap is the time gap between the vehicles. These two parameters are given from the released documentation.
The reward function is designed following the same strategies, that:
Reward function in ACC
v_diff is the velocity difference between the current velocity of the ego car and the target velocity, and a* is the action value from the last time step.
R_neg penalizes the agent when the ego car cannot maintain a safe distance from the lead car; R_pos rewards the agent when v_diff is smaller than a threshold. R_t is the final reward at the time step t, based on both R_pos and R_neg.
Unlike MPC, the DRL controller uses a reward function to evaluate the agent performance from 2 aspects: velocity and distance. While the safety distance d_safe is secured, the ego car should approach the cruise velocity v_set, otherwise, it follows the lead car velocity to avoid a collision. The reward function penalizes the agent by the violation of the safety distance requirement and rewards the agent based on how close the ego car velocity v_ego is to the target speed.
The next step is to configure the DNN structures inside the DRL agents. Since we deploy multiple types of agents as stated in the AI-CPS Construction section and all of them are actor-critic based, thus we need to configure DNN structures for each of them. We test different structures with various layer numbers and neuron numbers and find that a too simple network cannot approximate the state and action values correctly, but a too complicated network can increase the training time and decrease the computation efficiency. We use the following settings as the final DNN configurations:
Critic:
State path: featureInputLayer(5) -> fullyConnectedLayer(64) -> reluLayer -> fullyConnectedLayer(128)
Action path: featureInputLayer(1) -> fullyConnectedLayer(128)
Common path: additionLayer(2) -> reluLayer -> fullyConnectedLayer(256) -> reluLayer -> fullyConnectedLayer(64) -> reluLayer -> fullyConnectedLayer(1)
Actor:
featureInputLayer(5) -> fullyConnectedLayer(64) -> reluLayer -> fullyConnectedLayer(128) -> reluLayer -> fullyConnectedLayer(256) -> reluLayer -> fullyConnectedLayer(64) -> reluLayer -> fullyConnectedLayer(1) -> tanhLayer -> scalingLayer
The learning rate of the critic is 1e-3 and the actor is 1e-4 since making the actor learning smaller than the critic and helps to speed up the policy convergence.
The agent is set with a learning rate of 1e-3 and 20 steps look-ahead window. The training progress is shown below.
ACC training process
ACC sample outputs
From the training graph above, we find as the training episode increasing, namely, more iterations, the averaged reward and the average step also increased. It points out that out agent is continuously approaching the optimal policy which can obtain the maximum long-term reward and maximum steps. The simulation time is 30 s with 0.1 s sample time, thus the maximum steps of a single episode is 300. The agent reaches the maximum steps at about the 130th episode and keeps optimizing the policy to obtain more rewards while the step number is held on 300. From the training process, we notice the reward function correctly guides the agent to approach the optimal action strategies while following the safety requirements we sated at the beginning.
Refer to Section RQ1, RQ2, and RQ3.