LiDAR technology has become indispensable for generating high-fidelity Digital Twins and enabling autonomous robot navigation in dynamic 3D environments. However, the raw point clouds generated by these systems are inherently unstructured, noisy, and "messy," often lacking the semantic richness required for high-level decision-making. To transform this chaotic data into actionable spatial intelligence, Panoptic Segmentation is essential, as it uniquely combines the ability to classify background semantics (stuff) with the precision to distinguish individual object instances (things)—a challenge that often hinders scalable automation in complex scenes. This project proposes a comprehensive framework for 3D panoptic segmentation, building up from fine-grained geometric understanding to complex scene-level instance parsing.
To validate my geometric deep learning approach before scaling to full scene analysis, we first deployed the Dynamic Graph CNN (DGCNN) architecture on a furniture part decomposition task. Central to this architecture is the EdgeConv operation, which overcomes the limitations of processing points in isolation by dynamically constructing graphs in feature space. For each point, the model identifies its k=20 nearest neighbors and computes edge features—specifically the relative difference (x_j - x_i) combined with the center point x_i—to explicitly encode local geometric structure. This dynamic graph construction repeats across four hierarchical layers, allowing the model to group points based on learned semantic similarities rather than just physical proximity. Crucially, the architecture employs a "jumping knowledge" strategy where features from all four layers are concatenated and fused with a globally max-pooled signature of the entire shape, ensuring the final segmentation head has simultaneous access to both fine-grained local details and global object context.
Following the validation of geometric features on object parts, we scaled our approach to full 3D scenes using a query-based 3D segmentation model designed to handle the complexity and scale of scene-level data. Unlike the point-wise classification employed in our preliminary experiments, this model predicts masks over "superpoints"—precomputed over-segments of the scene—which significantly reduces computational overhead while preserving boundary adherence. The workflow begins with Voxelization and Feature Extraction, where the scene is discretized into sparse voxels carrying 6D features (XYZ + RGB). A lightweight backbone, such as a Minkowski sparse-convolution stack or an MLP, processes these voxels into high-dimensional embeddings, which are then pooled via scatter-mean operations into compact superpoint features.
The core of the architecture lies in its Transformer Decoder, which treats segmentation as a direct set prediction problem. The model initializes a combined set of learnable queries: K Instance Queries, sampled from superpoint features with learned biases to target potential objects, and C Semantic Queries, representing learned class embeddings. These concatenated queries pass through a 6-layer decoder where they first interact globally via self-attention to resolve context and occlusions, and then attend to the superpoint tokens via cross-attention to extract localized mask information. Finally, the decoded queries branch into two prediction heads: a linear classifier that assigns a semantic class (or "no-object") to each query, and a mask head that generates segmentation masks via a dot product between the query vectors and the superpoint features. This architecture allows the system to effectively "ask questions" about the scene, robustly identifying and isolating individual object instances even in cluttered environments.
The ultimate goal of this research is to unify the fine-grained geometric understanding from Trial 1 with the scalable, query-based instance reasoning from Trial 2 into a holistic Panoptic Segmentation framework. Future work will focus on fusing these architectures to simultaneously deliver semantic labels for background elements (walls, floors) and precise instance masks for foreground objects (machinery, furniture). This will involve refining the query mechanisms to handle "stuff" and "things" jointly and exploring domain-adaptive normalization techniques to ensure the model generalizes robustly across diverse indoor and construction site environments.