ClickFormer
Interactive Point Cloud Segmentation
Interactive Point Cloud Segmentation
Q: Why we need interactive point cloud segmentation?
A: To segment point clouds of diverse scenarios and novel categories.
Q: Can we focus solely on thing categories and disregard stuff categories?
A: Absolutely not, stuff categories are essential components of 3D scenes.
Q: Can existing methods segment stuff categories as well?
A: Hardly.
Q: Why?
A: Because of the scale disparity of instances. The network can't simultaneously handle instances as small as a person and as large as an entire road.
Q: Then how to address the scale disparity?
A: So we propose ClickFormer.
ClickFormer consists of the following three components: 1) a feature encoder that encodes the input point cloud into voxel features, 2) a query augmentation module that encodes user clicks and adopts a global sampling strategy, 3) a mask decoder composed of a query-voxel transformer that allows bidirectional updates between query and voxel embeddings, and a mask segmentation module that ultimately generates the desired mask based on the query and voxel embeddings.
We sample additional points evenly within the point cloud as augmentation queries. Due to the uniform spatial distribution of augmentation queries, larger instance scales lead to more queries participating in generating the desired mask. This ensures that the attention scope of each query is independent of the scale of the instance to be segmented, which eliminates the impact of scale disparity on the interactive segmentation model.
From the attention map of the queries, it can be observed that although the positive click query only focused on a small portion of the road, the augmentation queries effectively supplemented attention to other parts of the road, specifically targeting the edges of the road, the middle area, and the inner circle. With the assistance of the query augmentation module, the segmentation model complete the segmentation of the large-scale instance, the road, with only 2 user clicks, achieving a high accuracy of 94.57% IoU.
With an increased number of queries participating in the segmentation, if a query only focuses on its local neighborhood, it might mistakenly include instances that are far from the user’s desired mask as part of the foreground, leading to the generation of false positives. For example, in the bottom row of attention maps using local attention, an augmentation query mistakenly identifies another car as the desired instance, leading to a large number of false positives.
To mitigate this adverse effect, we replace the commonly used local attention with global attention in all attention layers, in order to promote global information exchange and suppresses the generation of false positives. As shown in the top row of attention maps, distant augmentation queries can also correctly focus on the desired instance through global attention, nearly doubling the segmentation accuracy. Simultaneously, they are able to recognize other instances of the same category.
ClickFormer achieves superior performance than existing interactive segmentation methods on both indoor and outdoor point cloud datasets.
To demonstrate the application of our method, we annotated a scene from ScanNet with 16 instances (including stuff categories). With only 1.88 clicks and 0.09s inference time per instance, ClickFormer achieved 76.08% mIoU.