The usage of 3D vision algorithms, such as shape reconstruction, remains limited because they require inputs to be at a fixed canonical rotation. Recently, a simple equivariant network, Vector Neuron (VN) (Deng et al., 2021) has been proposed that can be easily used with the state-of-the-art 3D neural network (NN) architectures. However, its performance is limited because it is designed to use only three-dimensional features, which is insufficient to capture the details present in 3D data. In this paper, we introduce an equivariant feature representation for mapping a 3D point to a high-dimensional feature space. Our feature can discern multiple frequencies present in 3D data, which, as shown by Tancik et al. (2020), is the key to designing an expressive feature for 3D vision tasks. Our representation can be used as an input to VNs, and the results demonstrate that with our feature representation, VN captures more details, overcoming the limitation raised in its original paper.
EGAD meshes constructed from the embeddings given by different models based on OccNet at canonical poses.
We call our representation Frequency-based Equivariant feature Representation (FER). We use it as an input to VN instead of 3D points and integrate VN into standard point processing architectures, PointNet and DGCNN, and show FER-VN-PointNet and FER-VN-DGCNN achieve state-of-the-art performance among equivariant networks in various tasks, including shape classification, part segmentation, normal estimation, and point completion, completion and shape compression. Notably, unlike the standard VN which performs worse than the non-equivariant counterpart in the point completion and compression tasks at a canonical pose, we show that FER-VN outperforms both of them by capturing the high-frequencies present in the 3D shape as illustrated in figures below.
Reconstructions of meshes from point cloud inputs across three models: the original OccNet(bottom), VN-OccNet(middle), and our proposed model(top). FER-VN-OccNet outperforms both approaches by utilizing frequency-based features. Unlike Vector Neurons, our method captures details present in cars (wheels, side mirrors) and chair's legs.
The graph shows the volumetric IoU of OccNet, VN-OccNet, and FER-VN-OccNet across the complexity level in the EGAD training set. We apply rotational augmentation during both training and test time. The right graph shows FER-VN-OccNet's IoU improvement over VN-OccNet.
Qualitative results of point cloud registration. Red is P1 and blue is P2 in the "Distinct sample'' setting. VN-EquivReg struggles to distinguish between similar features, like an airplane's tail and head, or the shade and base of a lamp. In contrast, our method successfully captures finer details, such as the airplane's tailplane and the lamp's cone-shaped shade.
Toy experiment to show the effectiveness of controlling the frequency of the augmented features. One shape from EGAD is regressed to networks based on the controlled features with different frequencies. As the reconstructed shape labeled 3 in the figure, the detained features are smooth out. When we increase the dimension of the representation by repeating the point coordinate twice to make the dimension of the representation 6 (3+3), still, the reconstruction quality is bad. This implies that just making the feature high-dimensional is not useful, and we need additional factors to increase expressiveness.
@inproceedings{
son2024an,
title={An Intuitive Multi-Frequency Feature Representation for {SO}(3)-Equivariant Networks},
author={Dongwon Son and Jaehyung Kim and Sanghyeon Son and Beomjoon Kim},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=5JWAOLBxwp}
}
This work was supported by Institute of Information & communications Technology Planning \& Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial In- telligence Graduate School Program (KAIST)), (No.2022-0-00311, Development of Goal-Oriented Reinforcement Learning Techniques for Contact-Rich Robotic Manipulation of Everyday Objects), (No. 2022-0-00612, Geometric and Physical Commonsense Reasoning based Behavior Intelligence for Embodied AI).