Targeted control of fast prototyping through domain-specific interface
摘要
评审与讨论
This paper proposes an LLM-based approach to translating designer's language to CAD language. The authors propose the approach of first translating the designer's instructions into an intermediate language called modeling language, and then translating the instruction in modeling language to CAD. To achieve this, the authors propose various sampling methods to generate several samples from the LLM, and then pick the sample with the highest score (based on three criteria (1) point-to-point implementation, (2) hierarchical decomposition, (3) incompatibility pruning). Finally, the authors show the superiority of their method compared to one-shot LLM-generation baselines via a user study.
给作者的问题
-
Is there any way to automatically benchmark the effectiveness of your method? I am not familiar with CAD design, but I saw a related work [1] evaluating their method using some metrics, such as point cloud distance to a ground-truth label.
-
Can you elaborate on how the objective in Eq. 6 is optimized? Do the authors use heuristics?
[1] Alrashedy et al., Generating CAD Code with Vision-Language Models for 3D Designs 2024
论据与证据
- There are various sampling techniques used in the paper that are not ablated, such as first using MCMC, and then using expectation maximization. These two methods are not completely ablated, and their benefit does not seem very clear to me. To me, it seems that the benefit is coming from taking multiple samples from the LLM, and then ranking the samples, but the baselines are one-shot, and without chain-of-thought. I would suggest the authors consider the following baselines: (1) Taking multiple samples from the LLM, and ranking them based on the criteria above Eq. 6, (2) Few-shot prompting the model to generate the intermediate language and then reranking the responses based on criteria above Eq. 6. This way, the contribution of each part of the method becomes more clear.
方法与评估标准
- The evaluated dataset is more of a user study, rather than a reproducible quantitative benchmark. This makes future comparisons harder. Especially given that the authors did not disclose what prompts the users input to the LLM, and what response they got.
理论论述
NA
实验设计与分析
- See Claims And Evidence for the missing experiments.
补充材料
I reviewed the appendix.
与现有文献的关系
Facilitating CAD design using LLMs could be an important problem, and the proposed methods, if effective, could potentially be used in other domains.
遗漏的重要参考文献
The related work discussion is very concise, and the authors only cite papers, rather than positioning their work in comparison to those papers. Two works that are cited, seem very relevant to this work, and are not discussed are (Wu et al., 2023) and (Yuan et al., 2024) in the paper.
其他优缺点
Overall, I find the reranking and sampling method in the paper interesting, but (1) very few baselines are considered, and (2) the evaluation is only based on an unreproducible user study.
其他意见或建议
- I suggest the authors provide more documents on their user study such as individual prompts, and the final figures generated for each method.
Is there any way to automatically benchmark the effectiveness of your method?
Thanks for the comment. We would like to clarify that we aim to directly assess the targetedness of each individual design instructions, e.g., "make the spout narrower". This measure is somewhat subjective, as designers must evaluate whether desired changes were implemented while undesired changes were suppressed.
Our work requires step-by-step assessment of each instruction within a sequence leading to a final product, whereas existing datasets typically evaluate only the final result after multiple instructions. This fundamental difference means that while conventional methods can create groundtruth references in advance, our approach relies on designers' on-the-fly instructions, making it impractical to prepare ground truth data (e.g., point clouds) beforehand. This is now discussed in the revised manuscript.
Can you elaborate on how the objective in Eq. 6 is optimized?
Thanks for the question. The optimization of Eq. 6 is achieved through iterations alternating construct expansion and feasibility validation. During the construct expansion phase, the heuristics adjust exploration strategies based on the interface’s state and feedback from prior iterations. When diversity is insufficient (e.g., limited variation in designs), the heuristics broaden exploration breadth by prompting the LLM with directives like “generate diverse handle configurations for teapots”. Conversely, if diversity is high, the heuristics narrow exploration breadth by prompting with constraints. When constructs are overly abstract (e.g., “refine shape”), heuristics increase exploration depth by decomposing the constructs into atomic operations. If constructs are excessively granular, the heuristics reduce exploration depth by encapsulating low-level commands into functions (e.g., “smooth contour”). In the feasibility validation phase, heuristics enforce alignment with the modeling engine’s capabilities by pruning constructs incompatible with CAD engine's constraints. This is now discussed in the revised manuscript.
There are various sampling techniques used in the paper that are not ablated. I would suggest the authors consider the following baselines:...
Thanks for the suggestion. In selecting our interfaces, we explored several alternative methods and evaluated them using intermediate metrics (soundness, completeness, and granularity alignment) that contribute to final performance. The methods we examined are framed as Single-MCMC (single-scale sampling with ranking) and Multi-MCMC (multi-scale sampling with reranking). Specifically:
(i) Single-MCMC aligns with the first suggested baseline for sampling and ranking constructs within a single chain but lacks multi-chain diversity.
(ii) Multi-MCMC mirrors the second suggested baseline for exploring diverse constructs via parallel chains but omits iterative optimization.
(iii) Our full method (Converged MCMC) extends these by integrating multi-scale sampling and iterative refinement.
Results are available here, validating our design choices. This analysis is now included in the revised manuscript.
The related work discussion is very concise.
Due to the space limit, please refer to the response to the first question by reviewer h3kU.
I find the reranking and sampling method in the paper interesting, but (1) very few baselines are considered, and (2) the evaluation is only based on an unreproducible user study.
Thanks for the comment. Regarding the concerns about baselines, we'd like to clarify that our work is not proposing an alternative to LLM-based CAD generators, but rather an interface to improve the performance of those generators. This functions as the first stage of an explicit two-stage approach, with LLM-based CAD generators serving as the second stage. To the best of our knowledge, there are no established baselines for this complete two-stage pipeline. We have incorporated state-of-the-art methods for the second stage.
Concerning reproducibility, we acknowledge that different prompts can yield different results. To address this, we input identical design instructions to all compared pipelines in each design step. While this cannot completely eliminate prompt-dependent variations, it represents our best effort to standardize the experimental protocol for measuring the inherently subjective concept of targetedness, providing a reasonable foundation for comparison. This is now discussed in the revised manuscript.
I suggest the authors provide more documents on their user study, and the final figures generated for each method.
Thanks for the suggestion. The records are available here, which are now included in the revised manuscript.
Thank you for the rebuttal.
Thanks for the suggestion. The records are available here, which are now included in the revised manuscript.
The pdf link is broken.
Optimization of Eq. 6 is achieved through iterations alternating construct expansion and feasibility validation. During the construct expansion phase, the heuristics adjust exploration strategies based on the interface’s state and feedback from prior iterations.
I do not understand how the feedback is generated. Is there a human in the loop? (i.e., how do you measure metrics such as "limited variation in designs", "constructs are overly abstract", etc).
In selecting our interfaces, we explored several alternative methods and evaluated them using intermediate metrics (soundness, completeness, and granularity alignment) that contribute to final performance.
I do not find any formulation of these metrics in the paper. Did you define these metrics somewhere in the paper? Are they automatic, or human-evaluated?
methods we examined are framed as Single-MCMC (single-scale sampling with ranking) and Multi-MCMC (multi-scale sampling with reranking). Specifically:
Why is MCMC necessary in the first place? To me, the improvements seem to arise from "taking multiple samples from the LLM" (i.e., a tree-of-thought-like methodology, see Figure 1 of [1]), rather than MCMC or other sampling methods.
[1] Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Overall, as someone without any CAD knowledge, I find the paper details and rebuttal responses difficult to follow. One suggestion that can make the paper more accessible to the general audience is to add one big figure and show concrete examples (e.g., numeric, or code) of all the variables discussed on page 6.
The pdf link is broken.
Thanks for the comment. We've tested the PDF link across multiple browsers but did not confront failure cases. In case you continue to experience difficulties, we've provided an alternative link that contains identical content.
I do not understand how the feedback is generated.
Thanks for the question. The feedback mechanism operates programmatically during feasibility validation, evaluating constructs through both LLM-as-a-judge analysis and CAD engine constraints [1]. It assesses three feasibility aspects and provides feedback to refine heuristics for subsequent iterations, all with minimal human intervention.
Criteria:
(i) Designer language constructs (e.g., a “ring-shaped teapot body”) cannot be translated into modeling operations. If a construct is unsupported, heuristics such as pruning incompatible geometric primitives are applied and the LLM is prompted to propose alternative base shapes (e.g., torus segments, since CAD engines do not directly support a 'ring'). A high frequency of such cases indicates that the constructs are overly abstract.
(ii) Modeling constructs (e.g., sofa cushions formed by fusing two cylinders) lack equivalent high-level design terms. For missing constructs, heuristics like generating composite-shape directives (e.g., “create cushion from combined cylinders”) are triggered in the next sampling iteration. A high occurrence of these cases suggests insufficient diversity in the designer’s language.
(iii) No overlap between the finest designer constructs and the coarsest modeling operations. For mismatches (e.g., unsupported “material texture” operations), heuristics such as incompatibility pruning are employed, permanently removing non-viable constructs.
[1] Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS, 2023.
I do not find any formulation of these metrics in the paper.
Thanks for the question. These metrics served as intermediate evaluation criteria during our interface development process. While they guided our design decisions, we initially omitted them from the paper due to space constraints.
Definitions:
(i) Soundness: ensuring all language constructs used by designers can be implemented in the modeling process;
(ii) Completeness: ensuring all modeling process constructs are represented in designers' language;
(iii) Granularity alignment: ensuring proper overlap between the finest-grained constructs in designers' language and the coarsest-grained constructs in the modeling process.
These metrics are automatically calculated using the LLM feedback generation methods described in our previous response. Implementation details are available in codebase in the supplementary materials.
Since multiple reviewers have raised this question, we have included descriptions of these metrics in the revised manuscript. We appreciate this question, which has helped us improve our paper.
Why is MCMC necessary in the first place?
Thanks for the question. The fundamental challenge of the targeted control of fast-prototyping is: while modeling languages are structurally documented, designers' language is unstandardized in the wild. This necessitates a systematic representation of concepts described in designers' language, which aligns with word learning, where humans learn systems of interrelated concepts rather than isolated terms.
Inspired by cognitive development research, we mirror how people learn concept networks through sampling and hierarchical organization via spectral clustering [1]. We chose MCMC as our sampling foundation because it allows adaptive control of both exploration scale and exploitation granularity through its parameters. In our implementation, we substitute environmental sampling with LLM-generated samples, leveraging LLMs as repositories of commonsense knowledge [2]. Given our methodological requirements, MCMC is one intuitive approach coming to mind, as it is a fundamental formulation for modeling guided stochastic sampling process in cognitively inspired research.
We acknowledge the conceptual parallels in multi-path exploration between Tree-of-Thoughts (ToT) and MCMC, and agree that integrating ToT-like branching specially designed for LLMs while maintaining with MCMC’s cognitive plausibility could be promising future directions. We appreciate the suggestion and will explore such synergies in future work.
[1] Tenenbaum et al. How to grow a mind: Statistics, structure, and abstraction. Science, 2011.
[2] Yildirim et al. From task structures to world models: what do LLMs know? Trends in Cognitive Sciences, 2024.
One suggestion that can make the paper more accessible to the general audience is to add one big figure and show concrete examples.
Thanks for the comment. The figure is revised following the suggestions.
This paper proposes a systematic procedure that maps human designers' high-level modeling requirements to a domain-specific language that can be executed by software to render modeling prototypes more aligned with human intentions. By recognizing and mitigating the gaps between designer's language and modeling programming language, the authors propose a LLM-based target control method for fast prototyping. Through extensive experiments on fast prototyping involving human subjects, the authors justify their methodology and present a key finding, which is that the authors' pipeline enables precise and effective targeted control of prototype models.
update after rebuttal
This paper is clearer as the authors provide a detailed literature review to help the reviewer understand the studied topic better and provide qualitative analysis of the method limitations. Therefore, the score is raised to 4.
I have read the comment by Reviewer tc7S and believe the experiments should also provide more reproducible results, apart from the user studies conducted in the paper. Nevertheless, this concern is not raised in my initial comment and will not affect my assessment.
给作者的问题
None
论据与证据
Yes, the claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
Yes, the proposed methods and evaluation criteria make sense for the problem or application at hand.
理论论述
Yes. The formulas in Section 3 are correct.
实验设计与分析
Yes. The experimental details and analysis in Section 4 are solid.
补充材料
Yes. The experimental details in Section B and Limitations in Section D were reviewed.
与现有文献的关系
The key contribution of this paper is a systematic way of transforming human designers' language into a domain-specific programming language for fast prototyping. As I'm not adept at computer-assisted designing, I'm not sure whether this paper is related to broader scientific literature.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
-
This paper is targeted at the application of large language models in computer-assisted design. This topic is interesting and valuable for the industry.
-
The paper proposes an LLM-based fast prototyping method, which is novel and justified by the experiments.
-
The paper provides extensive qualitative and quantitative experiments, which are very solid with detailed statistical analysis.
Weaknesses:
-
The authors have not provided a Related Work section in the main paper to present an overview of the studied topic.
-
Failure case analysis is also not presented to help the readers to understand the limitations of the proposed method.
其他意见或建议
The rating can be raised if the concerns are to be addressed.
The authors have not provided a Related Work section in the main paper to present an overview of the studied topic.
Thanks for pointing this out. Now we incorporate a discussion contextualizing our work within the domain of targeted control in fast prototyping.
Fast prototyping is a key process in industrial design, enabling rapid iteration and tangible feedback without the constraints of production-level precision [1,2]. Unlike production-ready modeling carried out by engineers [3], fast prototyping prioritizes exploratory design adjustments, making intuitive and high-level control crucial [4]. Inspired by the concept of targeted therapy in medicine, targeted control in fast prototyping aims to maximize desired modifications while minimizing undesired distortions. This requires an interface that aligns with designers' high-level thinking, focusing on components, structures, and relationships rather than low-level modeling commands [5].
Existing LLM-based CAD generators primarily assist modeling processes rather than fast prototyping, often requiring engineering-level language instructions or sketch instructions. For instance, Query2CAD refines CAD models through natural language queries [6], while Free2CAD translates sketches into CAD commands via sequence-to-sequence learning [7]. Recent advances leverage multi-modal capabilities, such as OpenECAD, which translates 3D design images into structured 2D and 3D commands [8], and CAD-Assistant, which employs tool-augmented vision-language models for general CAD tasks [9]. While these approaches enhance CAD generation and editing, they do not directly address the needs of fast prototyping.
Our proposed interface complements existing LLM-based CAD generators by bridging this gap. Rather than directly alternating CAD generation models, it serves as an auxiliary module, enabling designers to exert precise, high-level control in fast prototyping workflows. This discussion is now included in the revised manuscript.
[1] Burns, M. Automated fabrication: improving productivity in manufacturing. Prentice-Hall, Inc., 1993.
[2] Hallgrimsson, B. Prototyping and modelmaking for product design. Laurenceking, 2012.
[3] Barnhill, R. E. Geometry processing for design and manufacturing. SIAM, 1992.
[4] Uusitalo, S. et al. "clay to play with": Generative ai tools in ux and industrial design practice. In ACM DIS, 2024.
[5] Hannah, G. G. Elements of design: Rowena Reed Kostellow and the structure of visual relationships. Princeton Architectural Press, 2002.
[6] Badagabettu A. et al. Query2cad: Generating cad models using natural language queries. arXiv, 2024.
[7] Li C. et al. Free2cad: Parsing freehand drawings into cad commands. ACM Transactions on Graphics, 2022.
[8] Yuan Z. et al. OpenECAD: An efficient visual language model for editable 3D-CAD design. Computers & Graphics, 2024.
[9] Mallis D. et al.CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers?. arXiv, 2024.
Failure case analysis is also not presented to help the readers to understand the limitations of the proposed method.
Thanks for pointing this out. We have qualitative results as follows. The results are presented through links. Failure cases shape the boundary of our method and success cases .
Failure cases:
(i) For qualitative spatial constraints like "vertical" or "parallel", LLMs sometimes fail to map them correctly to precise positions and orientations due to their weak spatial reasoning.
(ii) For certain abstract and complex instructions---such as operations involving Bezier curves---current methods sometimes fail to capture the correct approach.
Success cases:
(i) Our-Int effectively maintains and guides the transformation of basic shapes and their modifications, mapping abstract designer instructions to concrete shape changes.
(ii) Our-Int can automatically infer a commonsense spatial distribution of components when designer instructions lack explicit spatial constraints.
(iii) Our-Int translates modifications at a finer granularity, identifying which component and which attributes are affected.
These analyses are now included in the revised manuscript.
This paper addresses the challenge of creating intuitive interfaces for industrial designers to control 3D prototype models using natural language rather than complex modeling commands. The paper seems to identify several gaps between "designers' language" and "modeling languages". For instance, designers may use high-level, semantic language (e.g., "make the spout curve gently"), while modeling languages use low-level, geometric commands. They propose an interface that serves as an intermediate DSL between designers' language and modeling commands. Their evaluation across eight product design domains shows their approach outperforms alternatives in rendering consistency and information clarity.
给作者的问题
--
论据与证据
- The authors' claim that their interface improves targeted control is supported by comparative evaluations against alternative methods.
- The human study provides reasonable evidence for the practical utility of the interface.
- However, claims about the interface reducing cognitive load for designers aren't directly measured - this is inferred rather than demonstrated.
- The paper claims their approach is generalizable across domains, but only tests on eight product categories which seem relatively similar (all being physical consumer products). I am not sure how this translates outside of this.
方法与评估标准
- The experimental setup using real design tasks with 50 participants is appropriate and relatively robust.
- Using both rendering consistency and information clarity as metrics makes sense for evaluating the interface's practical utility.
- However, something I'd maybe like to see is the evaluation on how the interface affects design iteration speed or quality of final designs - important metrics for any practical design tool (if that's the true goal since this is an application paper after all)
理论论述
The paper doesn't present any formal theoretical proofs as such.The probabilistic hierarchical model appears valid, though I didn't verify all mathematical details.
实验设计与分析
- The evaluation methodology comparing ranked preferences is reasonable, though somewhat subjective.
- The human study design with 50 participants is adequate, but I question whether 10 iterations per task is sufficient to capture the full design process.
- Would really like to see more qualitative examples to concretize a lot of the failures
补充材料
--
与现有文献的关系
I am not too familiar with the literature close to this space, however, I am reminded of one approach from traditional program synthesis that seems relevant to consider. DreamCoder (Ellis et al., 2021) tackles learning program libraries via hierarchical representations and domain-specific languages. DreamCoder's approach of learning domain-specific languages through Bayesian program induction shares conceptual similarities with how this paper constructs their interface DSL. The key parallels I see:
- Both systems use hierarchical abstraction to bridge between high-level intentions and low-level execution
- Both leverage iterative refinement of domain-specific languages
This paper could benefit from discussing and/or connecting to such work!
遗漏的重要参考文献
--
其他优缺点
Strengths:
- The paper tackles a practical and interesting problem with clear real-world applications
- The approach and experiments are sound and the human evaluation is good to see!
Weaknesses:
- The benchmarking metrics seemed like proxies to what would really matter? e.g., design iteration speed or quality of final designs
- The work is heavily focused on a specific application domain with unclear generalizability to other spaces.
- Limited discussion of computational efficiency and scalability; and lack of qualitative examples / case studies which I think are useful in application papers like these.
其他意见或建议
--
The benchmarking metrics seemed like proxies to what would really matter?
Thanks for the question. Fast prototyping allowing designers to explore brainstormed ideas without elaborating their instructions into modeling engineers' language. Our approach is explicitly two-stage, with our proposed interface serving as the first stage and LLM-based CAD generators as the second. Our metrics were selected to measure specific aspects of the process. Specifically, we use rendering consistency to measure how well the interface captures designers' intentions, essentially evaluating if desired elements appear and undesired ones are suppressed. Meanwhile, information clarity metrics quantify how effectively information transfers between designers' high-level language and modeling engineers' fine-grained requirements. We have clarified the rationale for these measurement approaches in the revised manuscript.
The work is heavily focused on a specific application domain with unclear generalizability to other spaces.
Thanks for the comment. Our main motivation is that fast prototyping allowing designers to explore brainstormed ideas without elaborating their instructions into modeling engineers' language, which in general investigates into the construction of a interface to bridge designers' language with modeling engineers' language. In broader context, our work provides a proof-of-concept for improving communication between domain-specific executors (Part-B) and stakeholders (Part-A). We totally agree that further investigation on generalization would be a promising direction. This is now discussed in the revised manuscript.
Limited discussion of computational efficiency and scalability; and lack of case studies useful in application papers.
Would like to see more qualitative examples to concretize a lot of the failures
Thanks for pointing these out. We have validated the computational efficiency by calculating the cost of targeted control, on the scale of eight domains.
We use OpenAI’s GPT-4o API for domain adaptation and real-time execution. Designing a domain-specific interface costs about \0.10 for translation and \$0.30 per ten refinement iterations for CAD generation.
In the revised manuscript, we add qualitative results on failure cases and success cases. Due to the space limit, please refer to the response to the second question by reviewer h3kU.
However, claims about the interface reducing cognitive load for designers aren't directly measured.
The paper claims their approach is generalizable across domains, but only tests on eight product categories which seem relatively similar.
Thanks for pointing out these confusions. We've revised our wording to emphasize the concrete benefit of "reducing designers' manual efforts" in fast prototyping. Regarding generalizability, we acknowledge our testing was limited to physical consumer products, a key reference for industrial design. To better reflect our scope, we've adjusted "across domains" to "across categories" in the revised manuscript.
The human study design with 50 participants is adequate, but I question whether 10 iterations per task is sufficient to capture the full design process.
Thanks for the question. Fast prototyping differs from full design. While full design requires master models for mass production, fast prototyping serves as an exploration for primary design ideas.
Design studies often use two setups: (i) counting iterations to reach certain resultsor (ii) evaluating improvements within a fixed iteration count. We adopt the latter to assess the improvements of each iteration with the same instruction, therefore relatively invariant to the number of iterations.
The current choice of 10 is informed by discussions with professional industrial designers to capture the typical lifecycle of fast prototyping, and would be further investigated in further study. This is now discussed in the revised manuscript.
This paper could benefit from discussing and/or connecting to such work!
Thanks for the insightful comment. Our automated interface design shares conceptual similarities with program synthesis in achieving hierarchical abstraction through iterative sampling and refinement. However, two key distinctions exist:
(i) Knowledge representation: program synthesis relies on structured knowledge, using exemplar DSL programs to generate higher-level libraries. In contrast, our approach samples from unstructured commonsense knowledge bases (e.g., LLMs) and organizes knowledge into a DSL;
(ii) Machine learning paradigm: program synthesis follows a supervised approach, leveraging I/O pairs with task specifications, whereas our method is unsupervised, emerging from commonsense knowledge bases.
This is now discussed in the revised manuscript.
This paper aims to address the problem of bridging industrial designers’ intuitive language and the precise modeling language of CAD modeling engines for fast prototyping. The authors introduce an interface (a domain-specific intermediate language) that translates designers’ natural-language instructions into modeling commands with sufficient granularity to capture design intent.
The authors propose a Domain Specific Language (DSL) approach that balances abstraction by mapping semantic parts and operations onto lower-level primitives. Results suggest the proposed interface yields better alignment between design intents and the actual modeling outcomes. The authors evaluate on two metrics: consistency and the clarity, and show that the DSL outperforms other baselines (e.g., directly prompting LLMs without interface).
给作者的问题
I'd happy to raise my score if the authors could expand on what makes the proposed method work better than the other baselines (beyond anecdotal observations), especially why it's able to beat LLM based approach by significant margin.
论据与证据
The main claim is that there's a gap between the way industrial designers communicate design intent (high-level, domain-specific) and the low-level geometry-driven commands used in modeling engines and that the proposed method can bridge this gap. Experiments show that the proposed method outperforms other baselines.
方法与评估标准
The proposed method automatically generates DSL via iterative sampling from an LLM (treating the LLM as a store of commonsense domain knowledge). A hierarchical approach maps domain concepts and permissible operations onto the low-level geometry commands of modeling engines. Finally, an MCMC–style procedure refine and validate these DSL constructs against the actual CAD function calls. The method makes intuitive sense but has many moving parts and hyperparameters. It's not immediately clear to me why this method is the obvious approach over all other possibilities (e.g., the use of DPMM).
Evaluation criteria are Rendering Consistency (evaluated by professional or trained designers) and Information Clarity. The evaluation protocol seems quite reasonable to me.
理论论述
The authors do not propose theoretical claims in this paper.
实验设计与分析
Strengths
- The conducted experiments are reasonable and the results are significant.
Weaknesses
- Lacks some analysis (ideally quantitative) and ablations on what advantage the proposed method has over the other baselines other than the final performance.
- The proposed method only compared to two other baselines, which seems a little slim to me. For example, the LLM prompting baselines could have very different performance if you prompt it differently.
补充材料
I checked section B.
与现有文献的关系
I fine the authors discussion on Relation To Broader Scientific Literature adequate and comprehensive.
遗漏的重要参考文献
N/A
其他优缺点
See Experimental Designs Or Analyses.
其他意见或建议
N/A
I'd happy to raise my score if the authors could expand on what makes the proposed method work better than the other baselines, especially why it's able to beat LLM based approach by significant margin.
It's not immediately clear to me why this method is the obvious approach over all other possibilities.
Thanks for the comment. Our work is not an alternative to LLM-based CAD generators, but rather an interface to improve the performance of LLM-based CAD generators by bridging designers' language with modeling engineers' language. The fundamental challenge is that while modeling languages are hierarchically documented, designers' language is unstandardized in the wild. This necessitates a systematic representation of concepts described in designers' language---from product categories to structures, components, attributes, and operations. This requirement aligns with word learning in cognitive science, where humans learn systems of interrelated concepts rather than isolated terms.
Drawing inspiration from cognitive development, we mirror how people learn concept networks by sampling from the environment and organizing these samples into hierarchical structures through DPMM. This spectral clustering approach captures multi-level attributes rather than clustering based on overall similarity [1]. In our approach, we substitute environmental sampling with LLM-generated samples, as LLMs are recognized repositories of commonsense knowledge [2]. We then apply DPMM to cluster these samples into a hierarchy of structures, components, attributes, and operations. This systematic representation allows natural instructions from designers to be decomposed into fine-grained elements that align with modeling language constructs, enabling targeted control in fast-prototyping.
Therefore, adding our interface to alternative approaches---such as forcing LLMs to directly translate between languages or inputting designers' instructions directly into LLM-based CAD generators, should intuitively benefit the performance.
[1] Tenenbaum et al. How to grow a mind: Statistics, structure, and abstraction. Science, 2011.
[2] Yildirim et al. From task structures to world models: what do LLMs know? Trends in Cognitive Sciences, 2024.
Lacks some analysis (ideally quantitative) and ablations on what advantage the proposed method has over the other baselines other than the final performance.
Thanks for the comment. In the revised manuscript, we establish a more comprehensive evaluation beyond the previous final performance, to guarantee the three major qualities of the LLM interface:
(i) Soundness: ensuring all language constructs used by designers can be implemented in the modeling process;
(ii) Completeness: ensuring all modeling process constructs are represented in designers' language;
(iii) Granularity alignment: ensuring proper overlap between the finest-grained constructs in designers' language and the coarsest-grained constructs in the modeling process.
Quantitative results for these metrics are available here, demonstrating the advancement of our interface. This superior systematic representation directly contributes to the improved final performance when integrated with an LLM for CAD. This analysis is now included in the revised manuscript.
The proposed method only compared to two other baselines, which seems a little slim to me. For example, the LLM prompting baselines could have very different performance if you prompt it differently.
Thanks for the comment. We would like to first clarify that our work is not proposing an alternative to LLM-based CAD generators, but rather an interface to improve the performance of LLM generators.
Our approach is explicitly two-stage, with our interface serving as the first stage and LLMs for CAD as the second. To our knowledge, there are no established state-of-the-art baselines for direct comparison of the entire two-stage pipeline. Comprehensive baselines exist for the second stage, so we have adopted the current SOTA to evaluate the first stage, s.t. our proposed interface.
We fully acknowledge the reviewer's point about prompt sensitivity. Actually, this is also one of our major concerns. To address it, we designed our evaluation protocol to minimize the influence of varying prompts by inputting the same prompt to all compared pipelines for each design step. While this approach cannot completely eliminate prompt variability, it offers a straightforward comparison by controlling this variable as much as possible. Investigating the influence of different prompts represents a promising direction and is now discussed in the revised manuscript.
This paper proposes a novel interface architecture using a Domain-Specific Language (DSL) to bridge the gap between high-level natural language instructions from industrial designers and low-level commands for CAD modeling engines during fast prototyping. The core idea involves using LLMs to generate and refine this intermediate language. The paper received a mixed reception from reviewers, with scores of 4 (Accept), 3 (Weak Accept), 3 (Weak Accept), and 2 (Weak Reject).
Strengths:
Reviewers generally agreed that the paper tackles an important and practical problem in CAD/design (Reviewers TWHa, h3kU). The proposed approach of using an intermediate DSL generated via LLMs was considered novel (h3kU). The inclusion of human studies demonstrating potential utility was also seen positively (ugjZ, TWHa).
Weaknesses & Concerns:
Initial concerns included a lack of clarity on why the method outperforms baselines (ugjZ), limited baseline comparisons and potential prompt sensitivity (ugjZ, tc7S), evaluation metrics being proxies for real-world design speed/quality (TWHa), limited generalizability testing (TWHa), lack of qualitative failure analysis and related work (h3kU), and concerns about the evaluation methodology's reproducibility (user study focus) and ablation of sampling techniques (tc7S). Reviewer tc7S, lacking CAD expertise, also found the paper difficult to follow and questioned the novelty beyond multi-sampling/reranking techniques common in LLMs.
Post Rebuttal & Discussions:
The authors provided detailed rebuttals, clarifying the interface's role (auxiliary module, not replacement), adding related work, qualitative examples (success/failure), details on efficiency, new intermediate evaluation metrics (soundness, completeness, granularity) with ablation results for their MCMC-based approach, and justifying the user study design for fast prototyping. This satisfied Reviewer h3kU (score maintained at 4) but did not convince Reviewer tc7S (score remained 2), who still found the explanations unclear, questioned the MCMC necessity, and remained concerned about reproducibility and the assessment of CAD-specific contributions. Reviewers ugjZ and TWHa acknowledged the rebuttals but did not change their scores (3).
While the paper addresses a relevant application and presents a novel approach with promising initial results (including a positive review from h3kU post-rebuttal), significant concerns remain. Crucially, Reviewer tc7S raises persistent issues regarding methodological clarity, the rigor and reproducibility of the evaluation (heavy reliance on user studies and LLM-as-judge), and the impact of having or not having CoT in the process that seems unfair, and the clear delineation of novelty beyond established LLM sampling techniques. Although the authors provided extensive rebuttals and added content, these did not resolve the core concerns for this reviewer. The other two reviewers remained weakly positive (3). Given the unresolved methodological questions and the split opinions, particularly the strong concerns from an LLM perspective (tc7S), the paper currently sits on the borderline.