Multi-Modal Continual Test-Time Adaptation for 3D Semantic Segmentation

*Paper Accepted by ICCV 2023*

Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Shenghai Yuan, Lihua Xie

[Paper] [Arxiv] [Poster]

In this paper, we explore Multi-Modal Continual Test-Time Adaptation (MM-CTTA) as a new extension of CTTA for 3D semantic segmentation. The key to MM-CTTA is to adaptively attend to the reliable modality while avoiding catastrophic forgetting during continual domain shifts, which is out of the capability of previous TTA or CTTA methods. To fulfill this gap, we propose an MM-CTTA method called Continual Cross-Modal Adaptive Clustering (CoMAC) that addresses this task from two perspectives. On one hand, we propose an adaptive dual-stage mechanism to generate reliable cross-modal predictions by attending to the reliable modality based on the class-wise feature-centroid distance in the latent space. On the other hand, to perform test-time adaptation without catastrophic forgetting, we design class-wise momentum queues that capture confident target features for adaptation while stochastically restoring pseudo-source features to revisit source knowledge. We further introduce two new benchmarks to facilitate the exploration of MM-CTTA in the future. Our experimental results show that our method achieves state-of-the-art performance on both benchmarks.

In this paper, we propose two new benchmarks for MM-CTTA, including SemanticKITTI-to-Synthia  (S-to-S) and SemanticKITTI-to-Waymo (S-to-W). Specifically, S-to-S is a more challenging benchmark with large domain discrepancy and more diverse environmental conditions, while S-to-W is a closer-to-application one. The corresponding label mapping can be referred to as follows:

1. SemanticKITTI-to-Synthia

a) Before Preprocessing

In S-to-S, we leverage sequence 01, sequence 02, sequence 04, and sequence 05 from Synthia (CVPR16 SYNTHIA VIDEO SEQUENCES version, which can be downloaded here). To generate the data for S-to-S, first download raw Synthia data and organize the data structure as Fig. 1.

b) Preprocessing

Subsequently, you can use this preprocess file to generate the simulated point clouds we extracted and their point-wise GT labels. Simply change the "/path/to/Synthia" in the script and run:

python synthia_preprocess.py

The resulting Synthia folder should be as Fig. 2. 

Hints:

Fig. 1 Synthia data folder before preprocessing.

Fig. 2 Synthia data folder after preprocessing.

2. SemanticKITTI-to-Waymo

a) Before Preprocessing

In S-to-W, we leverage both the location and weather description from Waymo dataset to divide the raw data into different sequences. To generate the data for S-to-W, you may first download the whole Waymo dataset (you can skip the testing split since it is not utilized in our work) and decompress all downloaded files. After all this is done, you should have a folder organization as Fig. 3. 

b) Split Generation

The next step is to extract the environmental context from the segment descriptions. You can utilize either our provided script to generate the sequence or our generated sequence text files directly. The generated sequence split files should be stored under the root dir of Waymo as Fig. 4.

c) Preprocessing

Finally, we extract the image and point clouds needed from the original *.tfrecord for a faster I/O. Simply download the preprocessing file, change the "/path/to/Waymo" in the script, and run:

python synthia_preprocess.py

After the preprocessing process, you should get the resulting data folder as Fig. 5. Note that the "cp_lidar" folder contains the point-wise image coordinate in the image captured by different cameras, which can be directly leveraged for point-pixel correspondences. During data loading, you can easily retrieve all the segment names from each sequence split file given a specific environmental keyword (e.g., "Dawn-Dusk_all-1.txt" for Dawn/Dusk). 


Fig. 3 Waymo data folder before preprocessing.

Fig. 4 Waymo data folder after split generation.

Fig. 4 Waymo data folder after split generation.