Kai-Hsiang Hsieh and Jui-Chiu Chiang
To encode point clouds containing both geometry and attributes, most learning-based compression schemes treat geometry and attribute coding separately, employing distinct encoders and decoders. This not only increases computational complexity but also fails to fully exploit shared features between geometry and attributes. To address this limitation, we propose SEDD-PCC, an end-to-end learning-based framework for lossy point cloud compression that jointly compresses geometry and attributes. SEDD-PCC employs a single encoder to extract shared geometric and attribute features into a unified latent space, followed by dual specialized decoders that sequentially reconstruct geometry and attributes. Additionally, we incorporate knowledge distillation to enhance feature representation learning from a teacher model, further improving coding efficiency. With its simple yet effective design, SEDD-PCC provides an efficient and practical solution for point cloud compression. Comparative evaluations against both rulebased and learning-based methods demonstrate its competitive performance, highlighting SEDD-PCC as a promising AI-driven compression approach.
The proposed SEDD-PCC architecture, illustrated in Fig. 1, consists of a single encoder and dual decoders designed for joint compressing of geometry and attributes. Since point cloud attributes are inherently tied to their corresponding geometry, our shared encoder processes the point cloud input as a three-channel 3D voxel grid. The input point cloud is represented as a sparse tensor with coordinates C = {(xi, yi, zi) | i ∈ [0, N-1]} F = {Ri, Gi, Bi | i ∈ [0, N-1]}, where N is the total number of points
A.Encoding and Decoding
The encoding process begins by voxelizing the input point cloud into a structured sparse tensor with three channels. This voxelized data is then passed through the shared encoder, where sparse convolutions progressively transform it into a compact latent representation z. The thumbnail point cloud geometry Cz is losslessly encoded using the G-PCC octree codec, while the corresponding feature Fz is quantized and entropy encoded. During decoding, the quantized feature F̂z is first processed by the transform module before being passed to the geometry decoder, which reconstructs the point cloud structure. The decoder follows a similar hierarchical approach as the encoder, progressively upscaling and classifying points using the Top-k+1 mechanism. Once the geometry is reconstructed, attribute decoding follows the Sparse-PCAC process to complete point cloud reconstruction.
B. Training Protocol
Fig. 2: The three-stage training strategy. Stage 1: attribute compression; Stage 2: geometry compression, adopting a teacher model for knowledge distillation; Stage 3: Fine-tune all components by initializing EncoderU and DecoderA from Stage 1, and DecoderG from Stage 2
We adopt a three-stage training approach, as illustrated in Fig. 2. This method consists of an attribute coding stage, a geometry coding stage, and a joint coding stage, each with a dedicated loss function using a Lagrangian loss : R+λD, where R denotes the bit rate, D represents the distortion, and λ serves as the trade-off parameter to balance the two terms.
A.Dataset
For our training dataset, we select ScanNet, as shown in Figure 3, which contains over 1,500 highly detailed 3D indoor scenes. To handle GPU memory constraints during training, we divide the original point cloud data into non-overlapping 6-bit-sized cubes in each dimension. From these partitions, 50,000 cubes are used for training. The implementation is carried out using PyTorch and MinkowskiEngine.
For testing, we used two datasets, as shown in Figure 4. The first is 8i Voxelized Full Bodies (8iVFB), including four 10-bit sequences: longdress, loot, redandblack, and soldier. The second is Owlii dynamic human mesh (Owlii), including two 11-bit sequences: basketball_player and dancer.
Fig. 3:Training Dataset. ScanNet
Fig. 4:Testing Dataset. 8iVFB and Owlii dataset
B.Performance Comparison
To comprehensively evaluate the performance of our SEDD-PCC, we compared it against standard MPEG benchmarks, including G-PCC TMC13 v23 Octree-RAHT and V-PCC TMC2 v22, following the MPEG Common Test Condition (CTC). Additionally, we compare our approach with several learning-based techniques for joint point cloud compression, specifically YOGA, DeepPCC, Unicorn, and JPEG Pleno. For a more comprehensive comparison, Table 1 presents the BD-BR (%) of our method and various benchmarks, using G-PCC as the anchor. Notably, our approach achieves substantial bitrate reductions compared to G-PCC, with average savings of 75.0% in D1-PSNR, 32.6% in Y-PSNR, and 33.2% in 1-PCQM. Furthermore, SEDD-PCC demonstrates superior coding performance when compared with JPEG Pleno for the Soldier sequence. Our method also shows a higher BD-rate saving in terms of 1-PCQM when compared to YOGA and DeepPCC, for the 11-bit sequences.
Fig. 5: R-D curves for longdress, loot, redandblack, soldier, basketball player and dancer.
Table. 1: BD-rate (%) against G-PCCv23 for various schemes
C.Complexity Evaluation
Table 2 presents the complexity analysis in terms of model size and encoding/decoding time. Due to availability constraints, only a subset of learning-based joint coding schemes is included. Our SEDD-PCC has a model size of 32.6 MB, making it significantly lighter than other methods. Its encoding and decoding process is exceptionally fast. Additionally, it eliminates the need for bit allocation between geometry and attribute coding for each point cloud, as well as recoloring, thereby reducing overall processing time.
Table. 2: Complexity analysis for encoding and decoding sequence “Soldier”