LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation

Amirreza Shaban*, JoonHo Lee*, Sanghun Jung*, Xiangyun Meng, Byron Boots

University of Washington

*Equal Contribution

Paper Supplementary Code
(Code release is in progress)

LidarUDA_with_animation.mp4

Abstract

We introduce LiDAR-UDA, a novel two-stage self-training-based Unsupervised Domain Adaptation (UDA) method for LiDAR segmentation. Existing self-training methods use a model trained on labeled source data to generate pseudo labels for target data and refine the predictions via fine-tuning the network on the pseudo labels. These methods suffer from domain shifts caused by different LiDAR sensor configurations in the source and target domains. We propose two techniques to reduce sensor discrepancy and improve pseudo label quality: 1) LiDAR beam subsampling, which simulates different LiDAR scanning patterns by randomly dropping beams; 2) cross-frame ensembling, which exploits temporal consistency of consecutive frames to generate more reliable pseudo labels. Our method is simple, generalizable, and does not incur any extra inference cost. We evaluate our method on several public LiDAR datasets and show that it outperforms the state-of-the-art methods by more than

3.9% mIoU on average for all scenarios. Code will be available at this https URL.

Method

Overview of the domain adaptation process. We first apply within-frame ensembling with the source model (source network) to generate the pseudo labels. Subsequently, we apply cross-frame ensembling with the LAM module to refine the initially generated pseudo labels. Then, we adapt the student network to the target domain by training it with the refined pseudo labels for a certain number of epochs, and finally, re-generate the pseudo labels from the trained student network. The cross-frame ensembling and adaptation steps are iterated multiple times.

Illustration of our within-frame and cross-frame ensembling modules. All the predictions in the figure are obtained from our nuScenes to SemanticKITTI experiment.

a) Within-frame ensembling: we randomly select horizontal rows with a probability of 1 - \min(1, 1/r) and drop all the points in the rows to simulate the different beam patterns of the target domain. To obtain more robust predictions, we apply this subsampling several times and average the prediction.

b) Cross-frame ensembling: with the obtained predictions from step a), we temporally aggregate the point clouds and their predictions. Afterward, we calculate the nearest neighboring points within ϵ-ball and predict their summing weight using LAM. Finally, we obtain the refined pseudo labels by weight averaging the pseudo labels of the neighboring points.

Quantitative Results

Qualitative Results

Visualization of two example frames from the held-out target domain data for SemanticKitti-nuScenes adaptation scenario. We compare the ground truth against pseudo labels from the base model (single scan), the cross-frame ensembling using uniform weights, and LAM. We circle the specific points of interest, where we see a noticeable improvement in segmenting small objects or sparse parts of a scene with LAM compared to other methods. Note that the unlabeled points are colored in black in the ground truth.