What Matters in Learning A Zero-Shot Sim-to-Real RL Policy for Quadrotor Control? A Comprehensive Study

Jiayu Chen*, Chao Yu*+, Yuqing Xie, Feng Gao, Yinuo Chen, Shu’ang Yu,

Wenhao Tang, Shilong Ji, Mo Mu, Yi Wu, Huazhong Yang, Yu Wang+

*Equal Contribution

+ Corresponding Authors

Tsinghua University

Precise and agile flight maneuvers are essential for quadrotor applications, yet traditional control methods are limited by their reliance on flat trajectories or computationally intensive optimization. Reinforcement learning (RL)-based policies offer a promising alternative by directly mapping observations to actions, reducing dependency on system knowledge and actuation constraints. However, the sim-to-real gap remains a significant challenge, often causing instability in real-world deployments.

In this work, we identify five key factors for learning robust RL-based control policies capable of zero-shot real-world deployment: (1) integrating velocity and rotation matrix into actor inputs, (2) incorporating time vector into critic inputs, (3) regularizing action differences for smoothness, (4) applying system identification with selective randomization, and (5) using large batch sizes during training. Based on these insights, we develop SimpleFlight, a PPO-based framework that integrates these techniques. Extensive experiments on the Crazyflie quadrotor demonstrate that SimpleFlight reduces trajectory tracking error by over 50% compared to state-of-the-art RL baselines. It excels in both smooth polynomial and challenging infeasible zigzag trajectories, particularly on small thrust-to-weight quadrotors, where baseline methods often fail. To enhance reproducibility and further research, we integrate SimpleFlight into the GPU-based Omnidrones simulator and provide open-source code and model checkpoints.

Paper (arxiv)

Code

Video

Real-world Experiments

Demo

Performance Comparison

Introduction

Methodology

Experiments

In-depth Analysis on Key Factors

Input Space Design

Smoothness Reward

SysID and DR

Effect of Batch Sizes

Real-world Experiments

Demo

Crazyflie 2.1

Air Quadrotor with 250mm arm length

Performance Comparison

For baseline comparison, we reproduce two SOTA RL-based quadrotor control policies deployed on Crazyflie and conduct a stress test comparing SimpleFlight with a well-tuned MPC method, PAMPC.

DATT is a feedforward-feedback-adaptive policy for CTBR command-based trajectory tracking, achieving SOTA performance over PID and non-linear MPC.
Fly proposes a high-speed simulator and RL-based framework for direct RPM control, enabling superior sim-to-real transfer.
PAMPC is a non-linear MPC method that jointly optimizes perception and action objectives.

Table 1 Real-world Trajectory tracking performance comparison of all methods across benchmark trajectories on Crazyflie2.1

We report the trajectory tracking performance of SimpleFlight compared to the baseline methods across all benchmark trajectories, as shown in Tab. 1. On the Crazyflie, SimpleFlight achieves significantly better results than all baseline methods, reducing MED by over 50% on average. On the Air, SimpleFlight achieves comparable performance to that on the Crazyflie and outperforms the finely tuned PAMPC, highlighting the ability of SimpleFlight to generalize across quadrotor models and sizes. * indicates that for DATT in the zigzag trajectory trials, 4 out of 10 attempts failed; the reported MED reflects the 4 successful trials.

Fly reliably tracks smooth trajectories at varying velocities but struggles with infeasible paths (e.g., fast pentagram and zigzag) due to limited long-horizon reasoning. DATT handles infeasible trajectories aggressively but fails in high-velocity tracking on low thrust-to-weight quadrotors. SimpleFlight excels in actuation constraint awareness, long-horizon reasoning, and optimization, particularly for sharp turns and complex maneuvers. The above result highlights SimpleFlight’s ability to generalize across quadrotor models and sizes.

Introduction

Precise and agile flight maneuvers are essential for UAVs, especially quadrotors, in a variety of applications. A significant challenge in RL-based quadrotor control is the sim-to-real gap, where policies trained in simulation often exhibit instability when deployed in the real world without additional fine-tuning. While various RL-based approaches have been proposed, there is no unified consensus on the key factors that contribute to training robust, zero-shot deployable control policies.

In this work, we investigate key factors essential for learning robust RL-based control policies capable of zero-shot deployment in the real world. We identify and summarize five critical factors of the entire training pipeline from the perspective of input space design, reward design, and training techniques. We conduct extensive real-world experiments on the open-source, open-hardware nano quadrotor Crazyflie 2.1 to validate the effectiveness of SimpleFlight. Furthermore, we integrate SimpleFlight into a high-parallel GPU-based simulator Omnidrones, and we open-source the code, model checkpoints, and benchmark tasks to ensure reproducibility. Our contributions can be summarized as follows:

We investigate several key learning factors and develop a PPO-based training framework, SimpleFlight, for learning RL-based control policies with zero-shot deployment capability.
We conduct extensive real-world experiments on the Crazyflie to demonstrate the effectiveness of SimpleFlight. The policy derived by SimpleFlight is the only one capable of successfully completing all benchmarking trajectories, including both smooth and infeasible trajectories.
SimpleFlight reduces trajectory tracking error by over 50% compared to SOTA RL baselines, despite not employing any tailored algorithmic or network architecture design.
We integrate SimpleFlight into the high-parallel GPU-based simulator Omnidrones and open-source checkpoints to ensure reproducibility.

Methodology

Fig. 1 Overview of SimpleFlight. We begin with SysID and selective DR for quadrotor dynamics and low-level control. Next, an RL policy is trained in simulation to output CTBR for tracking arbitrary trajectories and zero-shot deployed directly on a real quadrotor. The training framework focuses on three key aspects, i.e., input space design, reward design, and training techniques, identifying five critical factors to enhance zero-shot deployment.

As is shown in Fig. 1, we identify five critical factors to enhance zero-shot deployment:

Factor 1: Utilizing the rotation matrix instead of a quaternion, incorporating velocity into the actor’s input.
Factor 2: Adding a time vector to the critic’s input to enhance its temporal awareness.
Factor 3: Incorporating regularization of the differential action as the smoothness reward to penalize abrupt command changes.
Factor 4: Applying SysID for calibrating key dynamic parameters is crucial. DR exhibits selective effectiveness, improving performance for sensitive parameters like thrust coefficients while proving detrimental for insensitive or accurately measurable parameters.
Factor 5: Leveraging larger batch sizes during training.

Experiments

We perform training on a diverse set of reference trajectories, including smooth randomized 5-degree polynomials and infeasible zigzag trajectories, which may have either zero or undefined accelerations. The policy is trained for 15,000 epochs and takes about 6 hours on an NVIDIA RTX4090 GPU. Note that we only derive one policy for all trajectories.

We deploy the derived policy on the open-source, open-hardware nano quadrotor Crazyflie 2.1. The position, velocity, and orientation are provided by an OptiTrack motion capture system at 100 Hz to an offboard computer that executes the policy. The CTBR control commands are transmitted to the quadrotor at 100 Hz via a 2.4 GHz radio.

Fig. 2 Visualization of benchmark trajectories and corresponding trajectories followed using SimpleFlight. The reference trajectories are shown in black.

We adopt two types of trajectories as benchmark trajectories: smooth trajectories (figure-eight and polynomial) and infeasible trajectories (pentagram and zigzag). Among them, the figure-eight and pentagram trajectories are deterministic, while the polynomial and zigzag trajectories are randomly generated for each trial. All trajectories start from the origin (0, 0) with a fixed z-axis height. Examples of benchmark trajectories are shown in Fig. 2.

Real-world experiment results are shown in the first section.

In-depth Analysis on Key Factors

Input Space Design

Fig. 3 Training performance of input space designs

We evaluate three configurations:

time vector in both actor and critic (AC w/ t),
time vector only in critic (C w/ t, A w/o t),
no time vector (AC w/o t).

Results show that incorporating the time vector significantly improves tracking accuracy (AC w/ t and C w/t, A w/o t), as it enhances the critic's ability to capture temporal information and estimate state values.

However, including the time vector in the actor (AC w/ t) can cause out-of-distribution (OOD) issues during long-duration flights, as the reference trajectory's timesteps may exceed the training trajectory's maximum length. Thus, we include the time vector only in the critic to balance accurate value estimation with robust performance.

Smoothness Reward

We evaluate various smoothness components commonly used in existing studies, with the real-world tracking performance summarized in Table 2. Here, acct, jerkt, snapt represent the second, third, and fourth derivative of position at timestep t, respectively, and ut denotes the policy’s CTBR output at timestep t. Note that ||ut||2 penalizes desired angular velocity and thrust, indirectly constraining the third derivative of position, while ||ut − ut−1||2 penalizes angular acceleration and differential thrust, indirectly targeting the fourth derivative.

We perform a grid search over the hyperparameters for each component and report the best results. The term ||ut − ut−1||2 achieves the best tracking performance among all designs.

Table 2 Real-world tracking performance of different smoothness components

SysID and DR

Table 3 Real-world tracking performance of SysID and DR on the figure-eight trajectory at normal velocity

We acknowledge that precise SyslD is inherently challenging. Based on the results presented in Table 3 (consistent with TABLE I in our paper), our key finding is that the need for SyslD and DR varies significantly depending on the dynamic parameters.

Specifically, for parameters such as mass m and inertia I. which can be measured through conventional methods, accurate SyslD is essential, and domain randomization (DR) should not be applied. Introducing DR for these parameters increases training complexity, often causing the policy to converge to suboptimal solutions.

For the motor time constant Tm, it can be experimentally measured for larger quadrotors. However for smaller quadrotors like the Crazyflie 2.1. direct measurement is impractical. Instead. we reference typical values from DATT, which is an RL-based method, achieving SOTA tracking performance on the Crazyflie 2.1. Experiments in TABLE I demonstrate that sim2real performance is insensitive to Tm, rendering DR unnecessary. DR primarily increases the learning difficulty without providing significant performance benefits.

Regarding the thrust coefficient kf, we estimate it using the force balance equation during stable hovering. At the hover point, the equation 1/4 mg =r*kf*Omega_max^2 holds, where r is the throttle percentage and Omega_max is the maximum rotor speed. Thus, kf can be derived as kf=mg/(4rOmega_max^2). For the Crazyflie 2.1, we obtain Omega_max from the official documentation and measure the throttle percentage during hovering to estimate kf. Results in TABLE I show that sim2real performance is highly sensitive to kf, with parameter deviations leading to significant performance degradation. However, introducing DR substantially improves performance. Based on these observations, we recommend initially obtaining an approximate value of k, through SyslD and avoiding DR in the early stages. If the real-world performance is not so good, DR can be introduced as a potential method for improvement.

Effect of Batch Sizes

Fig. 4 The tracking performance of policies trained using different parallel environments.

To evaluate the impact of the batch sizes, we test simulation and real-world performance using figure-eight trajectories (slow, normal, and fast) via varying parallel environments. As shown in Fig. 4, increasing the batch size enhances real-world performance as simulation performance converges, with real-world results also stabilizing as the batch size grows further. Based on this finding, we recommend using larger batch sizes during training to enhance sim-to-real transfer.

Page updated

Google Sites

Report abuse