In this section, we provide some background knowledge about DRL, and then we introduce a detailed workflow about how DRL controllers are designed and trained to construct the AI-enabled CPS benchmark. Specifically, we introduce what DRL agents we have used, how the reward function is designed, the structure and configuration of DRL controllers, and the procedures of training. We also discussed some findings and insights of DRL controller design from our experiments at the end of this section.
Deep Reinforcement learning is an advanced approach in reinforcement learning which combines the strength from deep learning.
In reinforcement learning, the computer learns to achieve a target task by interacting with an unknown system (model-free). The agent learns and updates its decision-making policy by trial and error. DRL extends the RL boundary by incorporating neural networks to represent the policy function or other learned functions; such that, DRL is capable to manipulate the system with high-dimensional input states.
DRL Structure
As many industrial-level CPS have complex system dynamics, and functional requirements, DRL has been considered as a state-of-the-art method to solve the control problems in such plants. Many researchers in academia and industry have deployed DRL controllers on various application,, such as robatics, computer vision, aerospace, healthcare, and automonous driving. These works demonstrated that DRL controllers are promising to behave equally well or even better than traditional controllers with fewer expenses on design and modeling processes.
In our work, the construction of a DRL controller contains 5 steps: Environment analysis, reward design, agent configuration, training, and deployment. The DRL controller construction is an iterative process, to solve a problem occurring in later stages may require you to track down the reason from any stages before.
In the following parts, we will introduce each stage in detail.
DRL controller construction workflow
In this step we need to determine how the DRL agent interacts with the environment and the operational requirements of the environment. Since we already have CPS with build-in traditional controllers, our first task is to find out what inputs that traditional controllers take and what type of outputs the controllers generate.
Usually, the DRL controllers take identical inputs as traditional controllers, but it is possible that DRL controllers require less information, because some inputs may have linear relationships. Since we use MATLAB Simulink as our modeling plant, it is important that DRL controllers should have the same types of outputs as the traditional controllers.
The design of reward functions is likely the most critical when building a DRL controller since it defines the basic operation logic and system goals. A well-designed reward function guides the agent to maximize the expectation of the long-term reward. Before designing the actual function, we need to investigate the goals of the original systems and other safety requirements.
A system may have primary and secondary goals, we need to figure out the priority of each task and the corresponding conditions. In addition, some systems may have safety or other requirements on specific environment values, or some values may be bounded with constraints. We need to take all these tasks, constraints, and requirements into account and embed them into the reward function to achieve the optimal design.
An agent in DRL receives observations (states) and rewards from the environment and sends out the action (control) signals to the environment. The agent contains two components: a policy and a learning algorithm.
The policy maps the observations to certain actions, which is constructed by a deep neural network.
The learning algorithm keeps updating the policy parameters to find the optimal policy which can maximize the cumulative long-term reward.
Agents can have multiple parameterized function approximations to train the policy. In general, two types of approximations have been used.
Critics: Based on the received observation and action, a critic computes the state value to properly criticize the actor.
Actor: Based on the observation, an actor outputs the best action by using the feedback from the critic.
Agent Structure
Agents which use critics only to select actions are called value-based agents since they rely on an indirect policy representation. This type of agent usually works better with discrete action space, as it becomes computationally expensive for continuous action spaces.
Agents which use actors only to select actions are called policy-based agents since they rely on a direct policy representation. This type of agent is usually good to handle continuous action spaces.
In our work, to build DRL controllers with better performance, we select agents with both critics and actors. In these agents, the critic learns the value function from the reward and guides the actor to the optimal action. This type of agent is more complex but more powerful as well, which can deal with both discrete and continuous action spaces.
The agents thatwe have deployed on CPS in this work are listed below:
Deep deterministic policy gradient reinforcement learning agent (DDPG): A model-free, online, off-policy method with an actor-critic structure. DDPG is the most compatible agent for environments with both continuous action and observation space. We recommend trying this agent first.
Twin-delayed deep deterministic policy gradient reinforcement learning agent (TD3): An actor-critic, model-free, online, off-policy method. TD3 is an improved, more complex version of DDPG. If the reward function and other environment configurations work on DDPG, it is good to apply the same settings with TD3 agent for better performance.
Actor-critic reinforcement learning agent (A2C): A model-free, online, on-policy, actor-critic agent. In general, it has a faster training speed than other agents, but for continuous action spaces, an A2C agent cannot output a bounded action, thus it sometimes requires a different reward function from others.
Proximal policy optimization reinforcement learning agent (PPO): An actor-critic, model-free, online, on-policy agent. It optimizes a clipped surrogate objective function using stochastic gradient descent. We find that PPO sometimes can share the same reward and environment configurations with A2C agents. Compared with A2C, PPO is more stable but requires more training.
Soft actor-critic reinforcement learning agent (SAC): A model-free, online, off-policy, actor-critic agent. This type of agent computes the policy entropy to promote exploration, thus SAC is an improved, more complex version DDPG with extra training time.
As we use deep neural networks as approximations in critics and actors, the structure of DRL can significantly affect the performance of agents. For each system, we need to try different DRL configurations with multiple agent settings to find the optimal combination.
Once the environment, the reward function, and the DRL agent are settled, we can train the agent iteratively to find the optimal action policy. The proper termination conditions are needed to prevent the agent from wasting time on meanless states in each episode (iteration). Technically, at the beginning of training, the agent explores the state space with no prior experience, so it can easily fall into some states which should not be reached in practice. In this case, the termination mechanism will automatically stop the episode and restart the environment.
During the training, we need to observe how the reward and steps change with time. Generally, the episode rewards and steps may fluctuate for hundreds of iterations at the beginning, then the rewards and step should slowly increase as time goes and finally get flattened when the optimal policy is reached.
As we stated above, we have trained multiple DRL controllers for each system, but not all of them can operate as we expected. An improperly designed reward function, bad DNN or agent configurations are all possible to lead to failures in training. Thus, we only take the DRL controllers with competitive performance to deploy on CPS for later comparison.