ControlLoc: Physical-World Hijacking Attack on Camera-based Perception in Autonomous Driving （CCS 2025）

Summary

ControlLoc is a physical-world adversarial patch attack that hijacks camera-based perception in autonomous driving (AD) by manipulating both object detection and multiple-object tracking (MOT). Unlike prior patch attacks that only suppress detections or require persistent per-frame success, ControlLoc adopts a two-stage pipeline—(1) patch-location preselection and (2) targeted patch generation—that produces fabricated bounding boxes and erases original ones to force trackers to associate with attacker-controlled detections. ControlLoc is effective across multiple object detectors and MOT algorithms, achieves an average digital attack success rate of ≈98.1%, and demonstrates robust real-world performance (≈79% ASR) using a monitor as the attack vector. System-level evaluation in a production AD simulator shows severe downstream consequences (high collision and unnecessary emergency-stop rates), highlighting a practical vulnerability in camera-based AD perception.

Attack Overview

As shown in above figure. ControlLoc aims to induce dangerous driving outcomes by hijacking object trackers rather than merely causing transient detection errors. The attack supports two high-impact goals:

Move-in (false intrusion): fabricate a plausible object moving into the vehicle’s path to trigger unnecessary emergency braking.
Move-out (displacement): shift the perceived location of a real obstacle toward the roadside so the vehicle overlooks it, potentially causing a collision.

Attack flow: Initially, the attacker's vehicle is correctly detected and tracked by the AD vehicle. Attacker uses the first adversarial patch that can erases the attacker's vehicle and can fabricates a new offset one. Then, attacker uses the second adversarial patch that only can erases the attacker's vehicle to prevent re-association between the hijacker tracker and original BBOX. So the tracker is hijacked for R frames and the original object isn't tracker unitil H frames.

Key characteristics of the attack:

Short-term injection, long-term effect. The attacker only needs to manipulate a few consecutive frames; MOT’s tracker management retains the hijacked association for multiple frames, causing a sustained effect even after detections return to normal.
Monitor as attack vector. Using a display (monitor) to present dynamic adversarial frames enables the attack to generate the required appearance and motion cues that static printed patches cannot. Embedded adversarial content can be hidden in benign advertisement videos to improve stealth.
Tracking-agnostic hijacking. ControlLoc produces fabricated bounding boxes that maintain enough overlap and class consistency to be associated with the original tracker across typical MOT data-association thresholds, making the attack transferable across MOT implementations.

Attack Methodology

ControlLoc's attack methodology, as shown in the figure above, comprises two stages.

Stage I — Patch location preselection
We optimize a soft mask over candidate placement regions (e.g., vehicle rear) to identify locations that maximize attack efficacy. The optimization balances an adversarial objective with a mask clustering loss so the resulting high-sensitivity pixels form a contiguous patch region (suitable for a physical patch). A sliding-window aggregation over the sensitivity mask selects the final rectangle matching the physical patch size. This preselection boosts real-world success compared to random placement and keeps computational overhead low.

Stage II — Targeted patch generation (iterative process)
Given a selected patch location, we generate the patch with three coordinated components:

Finding the target fabricated bounding box (B_t). Iteratively shift the original bounding box along the attacker’s desired direction until the shifted box reaches the largest displacement that still meets MOT IOU association constraints. B_t is the target shape/location the patch should induce.
Bounding-box filtering (C-BBOX). Leverage the grid-based structure of modern detectors to select which detector proposals should be fabricated (B_f) and which should be erased (B_e). For anchor-based detectors this picks the best anchor in the grid cell; for anchor-free detectors a corrective offset is applied to find the candidate cell. This precise filtering focuses optimization on the proposals that actually control NMS and tracker association.
Loss design and optimization strategy. Two primary losses are used:
- Score loss (L_s): raises confidence of B_f (so it survives NMS) and suppresses B_e (so original boxes disappear).
- Regression loss (L_r): aligns the fabricated box’s shape and center to B_t (IOU and centroid distance terms).
Rather than naively combining these losses with fixed weights (which leads to gradient imbalance and poor results), ControlLoc uses a conditional optimization: when B_f already appears in post-NMS outputs and B_e is absent, we switch to refining L_r; otherwise we prioritize L_s. Total-variation regularization and Expectation-over-Transformation (EoT) augment robustness across lighting and viewpoint changes.

Physical realization. The final patch is rendered into short segments of a benign video shown on a monitor (e.g., 1–3 frames inserted), enabling realistic dynamic cues while remaining visually inconspicuous. Physical testing uses a compact monitor and realistic camera parameters to ensure transferability.

Experiments

We evaluated ControlLoc across digital benchmarks, controlled physical tests, and system-level simulation.

Digital evaluation

Tested combinations of four object detectors (including YOLO variants and Baidu Apollo OD) and four MOT algorithms (ApoT, BoT-SORT, ByteTrack, StrongSORT).
Datasets: BDD and KITTI clips selected for move-in and move-out scenarios.
Results: average Attack Success Rate (ASR) ≈ 98.1%, and successful hijacking typically requires only ~2.5–3.6 frames on average.

Comparison with prior hijacking baseline

Reproduction of a prior digital hijacking method shows poor effectiveness (low ASR), especially in the physical world (0% ASR in our experiments).
ControlLoc outperforms baseline methods by large margins due to precise BBOX filtering, mask preselection, and the conditional optimization approach.

Physical-world evaluation

Setup: generated patches displayed on a 32-inch monitor placed as an attack vector in realistic outdoor driving scenarios. We varied camera angles, lighting, and six background scenes.
Results: average physical ASR ≈ 79% across scenarios. Best performance observed under cloudy conditions (≈87% ASR); sunny and night conditions are slightly lower due to brightness and glare effects.
The baseline method remains ineffective in the physical tests (≈0% ASR).

System-level impact

Setup: Using Baidu Apollo perception stack and the LGSVL simulator, we translated perception hijacks into driving outcomes.
Findings: ControlLoc causes severe downstream effects — high vehicle collision rates and unnecessary emergency-stop rates — demonstrating that tracker hijacking can lead to concrete safety hazards in planning/control modules.

Demos

simu_attack_out_40.mp4

simu_attack_in_40.mp4

background5_final.mp4

background2_final.mp4

Research Paper

[CCS'25] ControlLoc: Physical-World Hijacking Attack on Camera-based Perception in Autonomous Driving

Ningfei Wang, Shaoyuan Xie, Takami Sato, Yunpeng Luo, Kaidi Xu, Qi Alfred Chen

ACM SIGSAC Conference on Computer and Communications Security (CCS), 2025. (Acceptance rate TBA)

[PDF] [arXiv] [Code]

BibTex for citation:

@inproceedings{ma2025controlloc,

title={{ControlLoc: Physical-World Hijacking Attack on Camera-based Perception in Autonomous Driving}},

author={Ma, Chen and Wang, Ningfei and Zhao, Zhengyu and Wang, Qian and Chen, Qi Alfred and Shen Chao},

booktitle={Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS’25)},

year={2025}

}

Team

Chen Ma, Ph.D. student, Xi'an Jiaotong University‌

Ningfei Wang, Ph.D. student, University of California, Irvine

Zhengyu Zhao, Professor, Xi'an Jiaotong University‌

Qian Wang, Professor, Wuhan University‌

Qi Alfred Chen, Assistant Professor, University of California, Irvine

Chao Shen, Professor, Xi'an Jiaotong University‌

Acknowledgments

This research was supported by

National Key Research and Development Program of China (2023YFB3107400);
National Natural Science Foundation of China (U24B20185, T2442014, 62161160337, 62132011, U21B2018);
Shaanxi Province Key Industry Innovation Program (2023-ZDLGY-38, 2021ZDLGY01-02).

Google Sites

Report abuse