FlyTrap: Physical Distance-Pulling Attack Towards Camera-based Autonomous Target Tracking Systems

The Network and Distributed System Security (NDSS) Symposium 2026

* For all the demonstrations, the video quality can be adjusted for better visualization.

Outline

Summary

Section V-A: Dataset Collection

Section V-E: Spatial-Temporal Consistency Evaluation

Section V-E-1: PercepGuard Evaluation

Section V-E-2: VOGUE Evaluation

Section V-F: White-box Physical Evaluation

Section V-F-1: Open-Loop Physical Evaluation

Section V-F-2: Closed-Loop Physical Evaluation

Section V-G: Black-Box Commercial Drones Evaluation

Commercial Drone 1: DJI Mini 4 Pro

Commercial Drone 2: HoverAir

Commercial Drone 3: DJI Neo

Section V-H: User Study

Summary

Autonomous Target Tracking (ATT), often referred to as Active Track, Motion Track, or Dynamic Track, enables autonomous systems, such as drones, to follow selected targets while maintaining a stable distance. Drones have become a prominent platform for ATT due to their versatility, supporting applications like security surveillance, border control, law enforcement, and entertainment. Real-world examples include the U.S. Customs and Border Protection’s use of drones for border surveillance. However, this technology also introduces significant security, privacy, and safety risks, particularly when exploited for criminal purposes, such as stalking or deploying explosives.

Given these risks, the security of ATT systems is critical. Our work found that ATT systems can be fundamentally vulnerable to a newly-discovered “Distance-Pulling Attack” (DPA), where drones running the ATT feature can be manipulated to dangerously shorten the distance to their tracked targets. DPA can lead to severe consequences, including collisions, physical capture of drones, and susceptibility to a broader range of sensor attacks, as illustrated in Figure 1. Unlike attacks that cause tracking errors, DPA enables attackers to more easily crash/eliminate drones or physically capture them (e.g., for personal/business gains such as by reverse-engineering their functions), which can have severe impacts on critical real-world ATT applications such as security surveillance, border control, and law enforcement. Addressing these vulnerabilities and understanding the implications of DPAs is both imperative and urgent to ensure the security and safety of ATT systems.

Section V-A: Dataset Collection

The dataset we collect includes 4 different individuals and 4 distinctive background locations (e.g., drivable road, bare ground, grass field, parking lot), which have 16 combinations in total. For each individual in each environment, we recorded two videos: one for training and one for testing, at 24 frames per second with a resolution of 1920 x 1080 pixels. Each video ranges from 11 to 37 seconds in duration. The dataset includes 23 training videos comprising 11,898 frames and 25 evaluation videos comprising 13,594 frames.

Section V-E: Spatial-Temporal Consistency Evaluation

Section V-E-1: PercepGuard Evaluation

We present qualitative results by showing the FlyTrap attack video demonstrations in addition to Table IV. With our attack target generation (ATG) design, we can manipulate the bounding box aspect ratio to bypass PercepGuard.

percepguard_atg.mp4

Demo 1: We evaluate FlyTrap w/ ATG against PercepGuard defense. LSTM predicts "person" across frames, no alarm is triggered

percepguard_wo_atg.mp4

Demo 1: We evaluate FlyTrap w/o ATG against PercepGuard defense. LSTM predicts "car" across frames, triggering the alarm

We evaluate FlyTrap with attack target generation (ATG) against PercepGuard, a spatial-temporal defense designed to secure autonomous vehicles. PercepGuard was originally designed for defending against object detection misclassification attacks. We adapt it to person tracking. If the output of the LSTM prediction is "person", we regard it as no alarm, while if the LSTM predicts other classes, we regard it as alarm.

Section V-E-2: VOGUE Evaluation

We present qualitative results by showing the FlyTrap attack video demonstrations in addition to Table V. With our attack target generation (ATG) design, we can achieve spatial consistency (i.e., overlapping tracking and detection prediction) and temporal consistency (i.e., consistent human pose).

vogues_attack.mp4

Demo 1: We evaluate FlyTrap w/ ATG against VOGUES defense. Our attack can be consistent across the single-object tracker (shown in red box), object detector (shown in blue box), and pose estimator (shown in human joints).

vogues_wo_atg.mp4

Demo 2: We evaluate FlyTrap w/o ATG against VOGUES defense. Without spatial-temporal constraints, no human is detected in the single-object tracker prediction area (shown in red box), thus triggering the alarm.

We evaluate FlyTrap with attack target generation (ATG) against VOGUES, a spatial-temporal defense designed to secure autonomous vehicles. VOGUES was originally designed for defending multiple-object trackers. We follow their setup while making necessary adaptations for single-object trackers. We compute the highest IoU between the single object tracking and the object detection. If the IoU is below a preset threshold (e.g., 0.5), the alarm will be triggered. This can prevent the high false alarm rate when applying to the single object tracking task when other passersby appear.

We also reproduce an LSTM to inspect the consistency of the human pose over time. The IoU and LSTM score are shown on the top right of the demo video.

Section V-F: White-box Physical Evaluation

Section V-F-1: Open-Loop Physical Evaluation

We present the video demonstration in addition to Figures 9 and 10. Without our progressive distance-pulling (PDP) design, the attacked bounding box locks on a fixed human-shape area on the umbrella at different distances. With our PDP design, the shrink rate is much smaller. At closer distances, the shrink rate is even smaller than that at longer distances.

mixformer_8m.mp4

MixFormer w/o PDP at distance of 8m

mixformer_cs_8m.mp4

MixFormer w/ PDP at distance of 8m

mixformer_2m.mp4

MixFormer w/o PDP at distance of 2m

mixformer_cs_2m.mp4

MixFormer w/ PDP at distance of 2m

siamrpn_resnet_8m.mp4

SiamRPN w/o PDP at distance of 8m

siamrpn_resnet_cs_8m.mp4

SiamRPN w/ PDP at distance of 8m

siamrpn_resnet_2m.mp4

SiamRPN w/o PDP at distance of 2m

siamrpn_resnet_cs_2m.mp4

SiamRPN w/ PDP at distance of 2m

Section V-F-2: Closed-Loop Physical Evaluation

To evaluate our attack in physical, closed-loop setups, we built a drone with full-stack ATT capabilities.

Physical evaluation setups. We use a MacBook Pro as the ground control station to select targets through a web portal (shown on the right). For safety issues, we let the drone be static and let the experimenter move until the bounding box is the same size as initialization, which is an approximation to autonomous tracking behavior.

white-box-att.mov

Demo: Our implemented full-stack autonomous target tracking drone.

Section V-G: Black-Box Commercial Drones Evaluation

To assess DPA’s real-world viability and highlight potential vulnerabilities in widely accessible commercial products, we conduct a black-box evaluation on three consumer-grade drones equipped with visual tracking systems. As noticed in (NDSS'25) Wang et al., the obscure implementations in commercial systems can heavily undermine the phenomenon observed in academia prototypes. Therefore, we evaluate whether our proposed DPA can be conducted in the physical world against real-world products instead of testing our PDP and ATG design, which is specifically for white-box systems. We will disclose the vulnerabilities to the corresponding manufacturers before the results are made public.

Commercial Drone 1: DJI Mini 4 Pro

We provide the following video demonstrations in addition to Figures 14 and 15 and Table VIII. The DJI Mini 4 Pro drone might implement ActiveTrackTrackingState (e.g., AIRCRAFT_TOO_LOW) to prevent the drone from directly crashing into the object, thus causing the final untracking and hovering behavior. Nonetheless, the distance between the drone and the attacker is largely shortened, thereby still justifying our DPA consequences.

dji_controller_screenshot_1.mp4

Demo 1: first-person view (FPV) from remote controller screenshot.

netgun_demo1.mp4

Demo 4: drone capturing attack using net gun.

dji_controller_screenshot_2.MOV

Demo 2: first-person view (FPV) from remote controller screenshot.

Demo 5: drone with a broken arm after netgun shooting.

dji_demo3.mp4

Demo 3: third-person view (TPV) from the observer.

max_tracking_dist.mp4

Demo 6: maximum tracking distance. the drone automatically gets closer if the initial distance is too far.

max_distance.mp4

Demo 7: FlyTrap attack against the DJI Mini 4 Pro under its maximum tracking distance (we empirically find this distance shown in Demo 6).

We also study the maximum tracking distance of the DJI Mini 4 Pro and the maximum FlyTrap working distance. We empirically find out that to improve the tracking quality, the DJI drone internally sets a maximum tracking distance of the target to ensure a high-resolution image of the target objects. In our environment, we find that the maximum distance is around 20 meters, as the drone will automatically move closer if the distance exceeds that range. Based on the observation, we test FlyTrap against the DJI Mini 4 Pro at a distance of 20 meters and find that it can still work.

Commercial Drone 2: HoverAir

We provide the following video demonstrations in addtion to Figure 16 and Table VIII.

hover_demo1.mp4

Demo 1: FlyTrap attack to HoverAir (side view).

hover_demo2.mp4

Demo 2: FlyTrap attack to HoverAir (first-point view).

hover_demo3.mp4

Demo 3: FlyTrap attack to HoverAir.

hover_compare.mp4

Demo 4: Normal umbrella for comparison.

Commercial Drone 3: DJI Neo

We provide the following video demonstrations in addtion to Table VIII.

dji_neo_demo1.mp4

Demo 1: FlyTrap attack to DJI NEO.

dji_neo_compare.mp4

Demo 2: Normal umbrella for comparison.

Section V-H: User Study

We conduct a user study to investigate the stealthiness of FlyTrap. The survey can be visible at: PDF and the results are shown below.

Copy of Visual Recognition User Study (Responses)

Page updated

Google Sites

Report abuse