Attention is all we need!

(Term Project)

The evolution of digital design tools is shifting rapidly from transactional CAD systems to conversational interfaces, culminating in the era of Agentic AI. While these advancements promise reduced friction and high-fidelity generation, they introduce a significant risk: "cognitive offloading". As automation increases, there is a danger that the designer is demoted from a reflective practitioner to a mere orchestrator, disengaging from critical decision-making. This research proposes a fundamental shift in the Human-AI relationship—moving from automated apprenticeship to a "Critical Companion.". The goal is to develop an agent capable of contextual awareness and grounded rationale that engages designers in critical thinking, reconciling the efficiency of Agentic AI with the necessity of human agency.

Site of study

The research is situated at the intersection of Generative AI and professional design workflows. Current studies reveal that while Agentic AI reduces workflow friction, it often creates a "Black Box" of collaboration where designers struggle to interpret "hallucinated" outputs or translate abstract intent into rigid parameters. This friction leads to frustration and a reliance on default settings rather than active co-creation.

The opportunity lies in defining a new interaction model that supports Metacognitive Engagement. By integrating Socratic behaviors—such as an agent that asks "Why?" or prompts "How does this impact assembly?"—we can force the designer to remain a reflective practitioner. The "Critical Companion" aims to use conversation not just to take orders, but to negotiate design intent, verify engineering plausibility, and build a shared mental model with the user.

CAD VLM model

To address the "knowledge gap" where text-based models fail to grasp geometric nuance, the first major technical prototype involved fine-tuning Vision-Language Models (VLMs). This iteration functioned as a "Visual Translator," designed to map 2D sketches or images directly into probabilistic sequences of CAD operations. The architecture utilized a LLaVa VLM fine-tuned via Low-Rank Adaptation (LoRA). The model ingested a single view image (e.g., a plate with holes) alongside a text prompt (e.g., "I need a geometry like this") to generate a Predicted CAD sequence. This sequence was then processed by a geometric solver to create the final 3D shape.To ensure both code accuracy and geometric fidelity, the training process employed a composite loss function: Cross-Entropy Loss to optimize the syntax of the generated code against the ground truth sequence, and Chamfer Distance to minimize the spatial disparity between the predicted and ground truth 3D models. While this dual-loss approach allowed the model to successfully generate simple primitive shapes, the reasoning remained shallow. Because the model treated geometry largely as a visual pattern, it lacked deep topological understanding, often failing to produce watertight, manifold geometry for complex objects.

Multimodal CAD VLM model

Building on the limitations of the Visual Translator, the next iteration, "The Contextual Integrator," adopted a multimodal approach to provide a more holistic understanding of 3D space. This model was fed a rich diet of inputs: text descriptions, 2D images, and—crucially—3D point clouds. The point cloud data was processed through a Pre-trained Michelangelo encoder, creating dense 3D embeddings that were fused into the LoRA fine-tuned LLaVa architecture. This allowed the model to correlate visual features (from the image) with spatial depth data (from the point cloud) before generating the CAD sequence. Like the previous iteration, this model was trained using a combination of Cross-Entropy and Chamfer Distance losses to align the solver-generated output with the ground truth. The inclusion of the Michelangelo encoder significantly enriched the model's topological understanding, allowing for superior spatial reasoning. However, despite the enriched input, the model still struggled with complexity, particularly in assembling large, multi-part systems or reasoning about tolerances and load-bearing connections. This highlighted the need for a system that moves beyond probabilistic guessing toward active planning.

Conclusions

The iterative development of these prototypes clarified the necessary architecture for a true Critical Companion: we must transform VLMs from passive imitators into active planners. The proposed path forward involves integrating Chain-of-Thought (CoT) reasoning to force the model to plan its moves and using Reinforcement Learning (RL) to provide critical execution feedback. By targeting executable Code (CadQuery) rather than raw geometry as the output, the system can ensure high-quality, parametric results. Ultimately, this thesis aims to develop an Orchestrator Layer that binds these elements, transforming AI from a tool that "looks right" to a collaborative partner that "works right".

Page updated

Google Sites

Report abuse