Generative AI models are increasingly embedded in professional CAD workflows — planning, reasoning, executing geometry on behalf of the designer. Yet across the current landscape of tools, a consistent failure persists: agents that can talk about structure without understanding it. They place geometry without reasoning about load. They produce outputs that look right but could not stand. The designer is left to compensate — injecting the physical intuition the system lacks, correcting hallucinated cantilevers, manually repairing geometry that passed no physical check before it was delivered. The pipeline is fast. The reasoning is absent.
This is not an edge case. It is the default condition of generative CAD today.
Brick by Brick was built to close that gap from the inside.
The central hypothesis driving this work is precise: anchoring a generative agent in real-world physical logic teaches it to produce feasible geometries with fewer hallucinations — and simultaneously makes its reasoning legible enough to support designer reflection rather than displace it. These two outcomes are not separate. A model that can justify why it places a brick where it does produces outputs a designer can interrogate, challenge, and redirect. Transparency and structural integrity are the same problem approached from two directions.
To test this, the project deliberately constrains the agent's action space to a single operation: placing bricks of varying dimensions in three-dimensional space. This is not a limitation — it is a precision instrument. By narrowing the output space to one action, the model can devote its full token budget to reasoning about that action: the center of mass shift it causes, the support it draws from the layer below, the cascade risk it introduces to everything built above. The structure that emerges is not a demonstration of what the agent can draw. It is a record of what it understands about physical reality.
The dataset underlying this system was adapted from the BrickGPT research corpus — 42,604 samples spanning 21 distinct structural object categories, from tables and chairs to sofas, cars, and birdhouses, all constrained within a 20 × 20 × 20 cm bounding volume. The original dataset was built around plastic interlocking LEGO bricks, with structural stability verified by the Gurobi Optimizer as a single instantaneous force-balance check. Gurobi treats friction as a user-supplied coefficient within static contact constraints — it has no notion of mass, density, or time. Combined with the assumption of secure studded connections, this permitted lightweight geometries to hold physically unrealistic cantilevers. That assumption had to go.
Each structure in the dataset was re-simulated from scratch in PyBullet using time-stepped rigid-body dynamics, replacing static force-balance with an evolving physical sequence. Bricks were reassigned the properties of solid red clay — density 1,900 kg/m³ — replacing snap-fit plastic with heavy, non-interlocking masonry that relies only on gravity, mass distribution, and surface friction. Construction is now tracked as an unfolding sequence rather than a finished state: unsupported overhangs fall, unstable placements cascade, and the model must account for how each brick shifts the center of mass and redistributes load as the structure grows. For every placement, the simulation logged center of mass, support polygon overlaps, overhang percentages, and tipping torques over time. As the dataset distribution shows, the largest categories — tables, chairs, and sofas — average only around 40% stability under these conditions, with only a small fraction of structures exceeding 80%.
From this physics-validated foundation, three distinct dataset variants were constructed to support the curriculum training approach.
Dataset for Stage 1 comprised 50,000 samples pairing abstract text prompts with brick placement sequences. No reasoning was included — the data simply mapped a prompt to a structure, establishing baseline spatial assembly without any physics annotation.
Dataset for Stage 2 reduced to 28,000 samples and expanded the chat template substantially. Each sample now included six structured reasoning sections alongside the brick placements: INFERENCE, which trained the model to translate abstract prompts into concrete geometric goals; BRICK SELECTION, which forced the agent to justify its choices — heavy bricks for load-bearing foundations, small bricks for detailing; PHYSICS CONSIDERATIONS, which required the model to identify potential tipping points and precarious overhangs before execution; AMBIGUITIES, which taught the model to explicitly declare missing information and state working assumptions rather than hallucinate; ALTERNATIVES CONSIDERED, which prompted the agent to weigh different structural strategies and defend its final choice; and REVIEW, a self-evaluation mechanism summarizing overall stability and brick count after the build. The reasoning sections were originally generated using the Qwen 2.5 32B LLM, which interpreted raw coordinate and simulation data into natural language. Generating this dataset required approximately 2.5 days of compute time across a cluster of 8 × NVIDIA H100 GPUs.
Dataset for Stage 3 was further refined to 11,000 samples and introduced two significant upgrades. First, abstract prompts were replaced with rich descriptive inputs — comprehensive shape descriptions from all angles, total layer count, exact brick count, and specific brick composition — generated by the Qwen 3.5 27B VLM interpreting plotted structural images alongside numerical data. Second, an object-part identifier script annotated which bricks belonged to which functional component of the structure, voxelizing the geometry to isolate spatial regions and map individual bricks to their structural roles. The dataset was optimized to fit a 2-day generation window while maintaining full coverage across all object categories.
The model was fine-tuned using a curriculum training approach, incrementally teaching the agent new capacities across three stages. Qwen 2.5 32B Instruct was selected as the execution agent for its ability to generate outputs in strict, predefined formats — emitting both geometry and reasoning in a schema Rhino could parse directly — while fitting within a single H100 GPU's memory. All fine-tuning used 4-bit QLoRA combined with Flash Attention 2 for memory efficiency. Brick placements were encoded in a consistent format — such as <Place 2×6 (0,10,4)> — enabling direct parsing and visualization within the Rhino environment. Training data was converted into strings, tokenized, and concatenated into the Qwen chat-style format pairing user messages with structured output responses.
Stage 1 Training taught the model the foundational spatial grammar of brick assembly using 50,000 samples. No reasoning was present — the model learned exclusively to arrange bricks without overlaps across a wide range of structural typologies. By the end of Stage 1, the agent reliably produced collision-free arrangements and could generalize to prompts outside the training distribution, though with no structural justification for any placement.
Stage 2 Training introduced physics-based reasoning using 28,000 samples with the full six-section structured template. The model was trained to present mechanical challenges before execution, track center-of-mass shifts brick by brick, calculate the exact percentage of load-bearing support from the layer beneath, evaluate cascade risks, and summarize structural stability after the build. This enriched, context-heavy data was tokenized and concatenated in the same fashion as Stage 1, bridging the gap between blind spatial drafting and physical engineering. Because the agent could now differentiate between stable and unstable placements, outputs were mapped to a color gradient in the Rhino interface — white for stable bricks through yellow and orange to red for unsupported placements — making structural integrity immediately legible in the viewport.
Stage 3 Training extended the model's capacity to geometric reasoning using 11,000 richly annotated samples. The agent was trained to identify which specific part of the overall object it was currently constructing — legs, backrest, tabletop, foundation — connecting local brick placements to global form. Descriptive multimodal prompts replaced abstract inputs, and the combination of physics and geometric reasoning enabled the agent to plan accurately against a stated brick budget, negotiate between structural feasibility and prompt fidelity, and generalize to out-of-distribution shapes by constructing from physical principles rather than pattern recall.
Testing across 144 samples — 3 per prompt across 16 prompts per stage — confirmed that physical and geometric reasoning produces measurably different outcomes at every level of analysis.
On structural performance, Stage 1 compensated for the absence of reasoning by over-building, using an average of 177 bricks per structure to maintain stability. Stage 2 broke this pattern: introducing physics reasoning produced structures that were genuinely sound rather than brute-forced, achieving mean physics stability scores of 0.73 at an average of 116 bricks — the same stability, at two-thirds the material cost. Stage 3 pushed further on efficiency, reaching the lowest mean brick counts in the dataset (93 bricks per structure) while maintaining structural integrity, with its score variance reflecting a deliberate commitment to prompt-specific features rather than generic stable masses.
On planning fidelity, Stage 3 won 62% of strategy comparisons against Stage 2 (38%), with the gap widest on harder prompts where Stage 2 tended to overshoot or abandon its declared plan. On visual coherence, Stage 3 led all three stages with a 40% win rate, followed by Stage 1 at 29% and Stage 2 at 10% — a counterintuitive finding that reveals Stage 1's strong visual results as a side effect of over-building, hitting recognizable shapes by stacking enough material rather than by understanding the form. The more important signal is Stage 3's generalization performance: its visual score dropped by only 0.26 between in-distribution and out-of-distribution prompts, against 0.42 for Stage 1, despite being trained on less than a quarter of Stage 1's data. Reasoning, not data volume, drives generalization.
The user study with six expert architectural designers — each with approximately ten years of CAD experience and four to five years with generative tools — surfaced equally clear patterns in how reasoning affected cognitive engagement. With Stage 1, designers spent 54.7% of session time passively waiting, 33.7% inspecting geometry, and 11.6% in active negotiation. No time was spent in attentive evaluation because the model produced no reasoning to read. Stage 2 produced the most active engagement of the three conditions: passive time dropped to 18.9%, inspection rose to 49.7%, and a new mode appeared — attentive evaluation of the model's structural rationale, which accounted for 25.8% of session time. Designers actively read the physics reasoning and verified it against the geometry on screen. Stage 3 partially reversed these gains: passive time climbed back to 43.3% and task completion rates fell from 70% under Stage 1 to 10% under Stage 3. The cause was not mistrust — 95% of users cited generation speed as the primary barrier to adoption, not confidence in the outputs. Stage 3 runs 7.8 times slower than Stage 1, devoting 95% of its output tokens to reasoning rather than placements, and takes 30 to 40 minutes to complete a 100–200 brick structure.
Qualitatively, the moments that mattered most were precisely the moments the system was designed to produce. One designer raised a brick budget after reading the model's structural justification for a hexagonal table. Another reduced structure height to recover slender chair legs after understanding why the model had thickened them. A third paused generation mid-build and pivoted a sofa design after the per-brick geometric reasoning revealed which blocks constituted the backrest. One participant modeled a slanted chair frame directly in Rhino and then asked the model to rebuild it in brick — the closest any session came to genuine fluid co-creation, where the designer delegated part of the work while actively building another part independently.
Brick by Brick demonstrated that physics-informed reasoning changes the nature of the agent's output in ways that go beyond structural metrics. Stage 3 does not just build more stable structures — it builds fewer bricks to do so, follows its own plans more faithfully, and generalizes to shapes it has never seen by reasoning from physical first principles rather than recalling memorized patterns. The hypothesis holds: grounding the model in physical reality produces feasible geometries with fewer hallucinations, and that same grounding makes the agent's logic visible enough for a designer to challenge and redirect it.
What the results also clarify, with equal precision, is where the current limits lie. The same reasoning that makes structures sound and outputs legible is what consumes 95% of the agent's token budget, introduces generation latency of up to 40 minutes per structure, and — beyond a threshold — pushes designers back into passive disengagement not from mistrust but from exhaustion. Moderate reasoning increases engagement. Excessive reasoning reduces it. The system resolved the physics gap and opened a latency gap in its place.
The path forward from here is not to reduce reasoning but to reorganize how it is delivered and when. The per-brick rationale is most valuable not as dense continuous text but as structured, on-demand information surfaced at the moment the designer needs it — which is precisely what the per-brick reasoning panel in the interface began to demonstrate. Combining inference-time optimization with a more legible reasoning interface is what separates a promising research prototype from a practical collaborative tool. The physics is now inside the model. The next problem is time.