RL algorithms provide a general framework that helps in learning autonomous policies for agent(s) by interacting with the environment in a trial-and-error fashion, as seen in Figure 1. RL algorithms aim at learning the optimal behaviour for an agent in an environment described in the form of a Markov decision process (MDP). An MDP is a decision process that provides the mathematical framework for modelling decision-making in scenarios involving partially random and partially controlled outcomes. Its important to note that the state transitions within an MDP adhere to the Markov property. RL uses the formal framework of MDPs to define the interaction between a learning agent (i.e., a UAV in the given scenario) and its environment in terms of states, actions, and rewards. In the given context, the MDP can be formally presented as a quintuple < S, A, Pss′ , R, s0 > where S represents the state of the agent, A denotes its feasible action set, Pss′ represents the state transition function from state s to s′, R denotes the reward function and s0 denotes the starting configuration of the agent in the environment. A state-action value (Q-value) function is defined as the cumulative rewards accumulated by the agent(s) based on its action in a given state. The idea revolves around iteratively interacting with the environment to determine the best action for a given state. This process balances exploration (discovering new possibilities) and exploitation (capitalizing on experience), ultimately leading to an optimal state-action value function.
Recent advancements in RL, including the application of Deep Reinforcement Learning (Deep RL) , have further extended the capabilities of UAVs. Deep RL has enabled the UAVs to comprehend complex state-action functions, allowing them to navigate intricate environments with high degree of precision and intelligence. This trail-and-error approach with continuous interactions with the environment empowers UAVs to autonomously learn and optimize actions based on experiences and rewards. Reinforcement learning sets itself apart from other approaches by not necessitating labeled input/output pairs or explicit corrections for sub-optimal actions. Instead, it focuses on finding a balance between exploration (of unknown environment) and exploitation (of current experience). Our work explores the applications of RL, particularly Deep RL, for multi-UAV control. Delving into its framework, the focus is on model-free algorithms as opposed to model-based algorithms that infer the model of the environment from its observations and then plan a solution using that model. For security and response applications, the model-free Deep RL algorithms are deemed to be more suitable, enabling UAVs to dynamically adjust and make instantaneous decisions. This adaptability is crucial in addressing the inherent uncertainty and variability of the environment.
Figure 1: The agent (UAV)–and-environment interaction in a Markov decision process
Figure 2: Multi-UAV applications in different sectors