Q1: 1) Are standardized/canonical poses assumed for the 3D models? 2) Are there mechanisms to handle symmetric or cylindrical objects?

A1: Regarding viewpoint standardization, our processing pipeline assumes a standardized object pose. During the data preparation stage, we normalize the 3D models by aligning their principal axes with the coordinate system. This is a standard preprocessing step to ensure we consistently render from six standardized orthogonal viewpoints.

The issue of potential information redundancy from symmetric objects highlights the advantage of our Information Aggregation Agent (Agent 2), particularly its adaptive selection mechanism based on the Multi-Armed Bandit (MAB). Unlike existing works use fixed rules (like simple concatenation or averaging), our MAB agent learns to dynamically evaluate the information value provided by each viewpoint. For a symmetric object, if multiple views generate similar or redundant descriptions, our MAB learns during its training process to lower the weight of these redundant "arms" (i.e., descriptions) and prioritize those that offer unique information. Thus, our Tri-MARF has an intrinsic robustness to handle such cases rather than treating them as experimental outliers. We have made the preprocessing step more explicit with more details in the revised paper.

Q2. Are LLMs predisposed to producing the first sentence of an output sequence? Is this an induced behaviour? via in-context learning?

A2: This is indeed a heuristic design, but it is an effective choice based on observations of LLM's behavior and our unique prompting strategy. Our VLM Annotation Agent (Agent 1) employs a multi-turn, structured dialogue strategy. We first ask the model to identify the object ("What is this?") and then follow up with questions to obtain specific attributes like color and material (Refer to Figure 2 for more details). This general-to-specific questioning approach, guided by prompt engineering, naturally leads the LLM to summarize the object's core identity in the first sentence of its response, with subsequent sentences providing detailed elaborations. This is a stable behavior of the LLM that we have observed and guided through prompt engineering, not a blind trust in the model.

In our Tri-MARF, this "core sentence" is primarily used to construct the beginning of the final description, ensuring it has a clear topic sentence. The rich details provided by other viewpoints are then appended to form a complete and comprehensive global description.

Q3. Justify the use of CLIPScore to measure performance

A3: Our Tri-MARF does not rely solely on CLIPScore, neither in its internal mechanism nor in its external evaluation. Firstly, in the core relevance weighting module of our Tri-MARF, we use a hyperparameter α to balance the VLM's own confidence with CLIP's visual alignment, which jointly form the final description score . This ensures the model is not designed merely to optimize for CLIP similarity. More importantly, we employ a diverse set of complementary metrics in the final performance evaluation. As shown in Table 1, we also include ViLT R@5 retrieval accuracy and human A/B test scores. These results strongly prove that its performance advantage is comprehensive and genuine, not a case of "overfitting" to a single metric.

Secondly, we chose to include CLIPScore since it has become the de facto standard in the field for evaluating image-text semantic alignment, as used by Cap3D, ScoreAgg, etc.

Q4. The doubts about the A/B test scores (Table 1).

A4: This is a misunderstanding of how the results are reported; the correct interpretation is the opposite. In our A/B tests, Tri-MARF is set as the fixed reference baseline. The A/B score listed under another method (e.g., for Cap3D) represents the average preference score given to our method by human evaluators in a pairwise comparison against that method (based on a 1-5 scale, where 3 is a tie).

The correct way to read the score is as follows: the 3.3 score under Cap3D means that in a direct comparison with Cap3D, human evaluators preferred our Tri-MARF, giving it an average score of 3.3. Since all baseline methods have an A/B score greater than 3, this indicates that our Tri-MARF is preferred in all pairwise comparisons.

To completely eliminate any ambiguity, we have followed your suggestion and redesigned the A/B test results table in the revised paper to explicitly show the win/loss/tie ratios, and we have rewritten the caption in detail to make the results clear at a glance.

Q5. Why does the gated agent only perform filtering and not try to self-correct to reduce manual labeling?

A5: Our current design is primarily based on considerations of efficiency and modularity. The Gating Agent (Agent 3) plays a lightweight final verification role in our Tri-MARF, whose main task is to quickly filter out hallucinated descriptions generated by the VLM that are clearly inconsistent with the 3D geometry, thereby forwarding only the most questionable samples to human reviewers. This ensures the high throughput of the entire process (12,000 objects/hour). Introducing an automatic self-correction loop would significantly increase system complexity and processing latency. This would be contrary to our design goal of achieving high efficiency.

Q6. Separate ablations for each agent and display the A/B score or VilT retrieval score for each newly added agent.

A6: To highlight the contribution of our Tri-MARF, we have designed and added a step-wise ablation study in the revised paper to demonstrate the performance evolution from a single agent to our Tri-MARF. This experiment is conducted on the Objaverse-LVIS dataset (with a random sample of 1k objects, the same subset used in the original paper). We compare four key configurations: (1) Single VLM Annotation Agent, using only a single front view for description; (2) VLM Annotation Agent + IAA, a combination of the first two agents to evaluate multi-view fusion capabilities; (3) Single Uni3D Gating Agent, to separately assess the ability to describe based solely on 3D geometric information; and (4) Tri-MARF (Ours).

Method	A/B Score↑	CLIPScore↑	ViLT R@5 (I2T)↑	ViLT R@5 (T2I)↑
Single VLM Annotation Agent	8.6	81.4	38.5	36.7
VLM Annotation Agent + IAA	7.2	63.2	23.7	22.9
Single Uni3D Gating Agent	6.5	58.3	20.9	19.1
Tri-MARF (Ours)	9.3	88.7	45.2	43.8

The results above demonstrate the necessity of the holistic and synergistic design of our Tri-MARF. The strong performance of the single VLM agent (CLIPScore 81.4), as an effective base model, significantly outperforms the geometry-only Uni3D agent (58.3). However, there is a counterintuitive phenomenon that combining the first two agents leads to a performance drop to 63.2. We attribute this to the fact that, without final geometric verification, the blind aggregation of multiple (and potentially noisy) views "pollutes" the high-quality description from the best single view. This phenomenon precisely highlights the indispensable role of the Gating Agent (Agent 3). After introducing the Gating Agent, our Tri-MARF's performance leaps from a CLIPScore of 63.2 to 88.7, a stunning 25.5-point increase. This decisively proves that the Gating Agent acts as a critical "arbiter", leveraging 3D point cloud information to resolve multi-view conflicts and VLM hallucinations.

Q7. The setup for measuring A/B testing quality is not clear.

A7: We have already provided a comprehensive and standardized explanation of the detailed settings for human evaluation in Section 12, which clarifies the recruitment criteria for evaluators, including their required annotation experience and language proficiency. It then details the specific procedures for each experiment (e.g., 3D caption quality evaluation, type annotation validation, etc.), covering task design, the number of samples and evaluators, and explicit scoring criteria (e.g., a 1-5 Likert scale). To ensure the reliability of the evaluation results, we have also implemented strict quality control measures, including a pre-task training process for the evaluators, and quantified the reliability by calculating inter-annotator agreement (Cohen's Kappa reached 0.76), thereby guaranteeing the fairness and scientific validity of the entire evaluation process.

Q8. Typos in formulas and grammar.

A8: We have corrected these and proofread the whole paper. Specifically, regarding the mathematical expression on Line 122, we have removed the extraneous dollar signs surrounding the formula to present it as a standard inline equation, . Regarding the grammatical error on Line 128, we have corrected the verb "frames" to "frame"; for the inline mathematical expression on Line 202, we have rewritten it using clearer, standard notation, changing the original format to ; regarding the cluttered layout in Section 3.2.2, we have thoroughly re-typeset the subsection for clarity, which includes correcting the jumbled symbols in formula (Line 227) to a standard reinforcement learning objective function, fixing the extraneous semicolon and using the standard '\arg\max' command in the UCB1 algorithm expression (Line 231) to be , correcting the erroneous comma in the empirical mean update formula (Line 235), and adjusting the paragraph structure throughout the subsection to ensure a logical and coherent flow of the argument.