Kai-Hsiang Hsieh and Jui-Chiu Chiang
Fig. 1: The proposed MEGA-PCC architecture.
Joint compression of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on post-hoc recoloring procedures and manually tuned bitrate allocation between geometry and attribute bitstreams in inference, which hinders end-to-end optimization and increases system complexity. To overcome these limitations, we propose MEGA-PCC, a fully end-to-end, learning-based framework featuring two specialized models for joint compression. The main compression model employs a shared encoder that encodes both geometry and attribute information into a unified latent representation, followed by dual decoders that sequentially reconstruct geometry and then attributes. Complementing this, the Mamba-based Entropy Model (MEM) enhances entropy coding by capturing spatial and channel-wise correlations to improve probability estimation. Both models are built on the Mamba architecture to effectively model long-range dependencies and rich contextual features. By eliminating the need for recoloring and heuristic bitrate tuning, MEGA-PCC enables data-driven bitrate allocation during training and simplifies the overall pipeline. Extensive experiments demonstrate that MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines, offering a powerful solution for AI-driven point cloud compression.
Geometry and attributes in point clouds are inherently interdependent—attributes are defined at specific 3D coordinates, and their accurate representation relies on the underlying spatial structure. This intrinsic coupling motivates unified compression frameworks that model both modalities jointly, rather than in separate stages.
To address this, we propose MEGA-PCC, a unified and end-to-end trainable framework that simultaneously compresses geometry and attributes, as illustrated in Fig. 1. MEGA-PCC extends Mamba-PCGC—originally designed for geometry-only compression—by incorporating attribute representation into a shared latent space. The input point cloud is voxelized into a sparse 3D grid, where each point is represented by geometric coordinates C = {(xi, yi, zi) | i ∈ [0, N-1]} and attributes F = {Ri, Gi, Bi | i ∈ [0, N-1]} where N is the total number of points. The coordinates and attributes are jointly encoded into a three-channel volumetric input. The unified encoder combines sparse convolutions for local structure modeling with multi-directional SSMs to capture long-range dependencies. Unlike the unidirectional Mamba block in, MEGA-PCC adopts three directional SSM modules, including Forward SSM , Backward SSM , and Channel SSM to model spatial context more comprehensively along different axes.
During encoding, the input is transformed into a compact latent representation z, consisting of a geometric skeleton Cz and a feature tensor Fz. The structural skeleton is losslessly encoded using G-PCC to preserve occupancy information, while Fz is quantized and entropy encoded using the proposed Mamba-based Entropy Model (MEM). In decoding, the quantized features F̂z are first used by the geometry decoder to reconstruct the point coordinates ĈG, following a coarse-to-fine strategy that includes a Top-k+1 selection mechanism. Subsequently, the reconstructed geometric coordinates ĈG guide the attribute decoder to reconstruct the attributes. This tightly coupled decoding process ensures that geometry and attributes are jointly optimized, resulting in coherent and efficient compression.
A. Mamba Block in Encoder and Decoder
To enhance spatial feature modeling in the proposed scheme, we introduce the Mamba block, used throughout the encoder and decoder of MEGA-PCC. Its architecture is illustrated in Fig. 2. Since Mamba operates on 1D sequences, we serialize the 3D voxel grid using a Morton scan, as illustrated in Fig. 2 (a). This scan preserves spatial locality during flattening, converting the 3D sparse tensor into a 1D sequence while maintaining the neighborhood structure to the extent possible. The resulting latents are then divided into groups, and processed independently by dedicated Mamba modules, enabling parallel computation across groups while allowing autoregressive modeling within each group. However, relying solely on Morton-based serialization may not fully capture complex spatial dependencies.
Inspired by Vision Mamba and Mamba3D, we propose Tri-Mamba in a Mamba block, as illustrated in Fig. 2 (b), which integrates three complementary scanning directions: Forward SSM processes natural sequence order, Backward SSM handles reverse order for comprehensive context, and Channel SSM explores inter-channel relationships. By aggregating the outputs from all three directions into unified representations, Tri-Mamba effectively captures both fine-grained spatial details and global context, which is critical for accurate joint compression. The Mamba block is deployed across three downsampling stages in the encoder, each stage progressively reducing point cloud resolution. To maintain efficiency, we adopt adaptive group sizing: larger blocks are used at higher-density levels, and smaller blocks are used as the resolution decreases. We denote the group sizes as M = [M1, M2, M3] for the three stages, respectively.
Fig. 2: (a) Serialize point clouds into a sequence while preserving spatial proximity relationships between consecutive elements in the sequence (b) Tri-Mamba, which is used in both encoder and decoder for feature extraction, combines forward, backward, and feature channel scanning to comprehensively understand spatial information and leverage channel-wise information to enrich feature representation.
Fig. 3: Mamba-based Entropy model.
Fig. 4:Training Dataset. ScanNet
Fig. 5:Training Dataset. RWTT
Fig. 6:Testing Dataset. 8iVFB and Owlii dataset
B. Mamba-based Entropy Model
Accurate entropy modeling in point cloud compression remains a major challenge due to the irregular and sparse structure of point clouds. A common baseline is the factorized entropy model, which assumes independence across spatial positions or feature channels. However, such simplification fails to account for the rich spatial and semantic correlations inherent in point cloud data. Recent approaches, such as ANF-PCGC++, improve over the factorized model by incorporating spatial context to capture local dependencies. Nonetheless, they still overlook channel-wise correlations, which are critical for modeling interactions across feature dimensions. To overcome these limitations, we propose the Mamba-based Entropy Model (MEM), shown in Figure3. MEM introduces a unified context model that captures both spatial and channel-wise dependencies in the latent features, enhancing the accuracy of probability estimation for entropy coding.
A.Dataset
We train our model on two datasets: ScanNet and RWTT , as shown in Figure 4 and 5. which contains over 1,500 richly annotated indoor 3D scenes, and RWTT, known for its detailed color and texture information. To accommodate GPU memory constraints, the original point clouds are partitioned into non-overlapping cubes of size 7 bits, from which 15,000 cubes are sampled for training.
For evaluation, we follow the G-PCC Common Test Conditions and feed the entire point cloud without partitioning, as shown in Figure 6. Testing is conducted on two benchmark datasets. 8i Voxelized Full Bodies (8iVFB), includes four 10-bit sequences: longdress, loot, redandblack, and soldier. Owlii dynamic human mesh (Owlii) contains two 11-bit sequences: basketball_player and dancer.
B.Performance Comparison
To comprehensively assess the effectiveness of our proposed MEGA-PCC method, we compare it against the classical standards G-PCCv23 and V-PCCv22, as well as recent learning-based joint compression methods, including YOGA, DeepPCC, Unicorn , and JPEG Pleno VM4.1. The evaluation leverages multiple quality metrics for a comprehensive performance assessment. Table1 summarizes Bjøntegaard Delta Rate (BD-Rate) results for D1-PSNR, Y-PSNR, and 1-PCQM, with all values reported relative to G-PCCv23. Figure.7 shows the corresponding rate-distortion (R-D) curves evaluated with 1-PCQM. Experimental results demonstrate that MEGA-PCC substantially outperforms G-PCC across all metrics. Specifically, it achieves an average BD-Rate saving of 80.4% in D1-PSNR, 43.6% in Y-PSNR, and 49.6% in 1-PCQM over the G-PCC baseline. Compared to V-PCC, MEGA-PCC delivers a clear advantage in high bitrate scenarios, as shown in Figure5.
Moreover, it maintains competitive performance with JPEG Pleno, YOGA, and DeepPCC, while operating with significantly lower computational cost. From Table2, these baseline methods involve the computationally expensive recoloring. In particular, for 11-bit sequences, MEGA-PCC outperforms both YOGA and DeepPCC, which perform worse than G-PCC in this setting. This result highlights MEGA-PCC’s stronger generalization capability to different bit-depths. Although MEGA-PCC does not surpass Unicorn in rate-distortion performance, its design is far more efficient. Unicorn leverages a complex multi-scale entropy model and larger network architecture, resulting in higher computational demands and longer runtime. In contrast, MEGA-PCC achieves strong compression performance with a simpler and faster model.
Fig. 7: R-D performance of the proposed scheme in terms of 1-PCQM.
Table. 1: BD-Rate (%) comparison for geometry and attribute distortion relative to G-PCCv23
C.Complexity Evaluation
Table2 presents the complexity analysis for the sequence soldier, including model size and runtime. Among the learning-based joint compression methods, only JPEG Pleno VM 4.1 provides publicly available inference code, making it the sole baseline for a complete runtime comparison.
Compared to MEGA-PCC, JPEG Pleno VM 4.1 not only has a larger model size but also a significantly longer processing time, partly due to certain modules being implemented on CPU. In particular, the recoloring step alone takes 2.18 seconds and is included in the reported encoding time. In contrast, MEGA-PCC features a lightweight design with substantially lower encoding and decoding times. In addition, MEGA-PCC eliminates the need for recoloring and model matching between geometry and attribute streams during inference. These steps are typically required in other pipelines.
Table. 2: Complexity analysis of model size and runtime on the sequence soldier