We sincerely thank you for taking the time to carefully review our work and for acknowledging the additional ablations we provided regarding GeoGLIP. We understand your concerns about the lack of detailed analysis on data curation and the perceived relatively low performance of our model. We would like to address these points further:

1. Data Curation Analysis: regarding data curation in our initial response. To provide a more comprehensive explanation, we have broken this issue into the following two points based on Question 4: 1) Difference between our data curation and other mathematical papers ("The math problems may already be solved with better data curation or reasoning processes, as many papers have done on such problems"). 2) Superiority of our methods compared to previous mathematical MLLMs ("The authors could provide explanations and superiority for the proposed methods and provide comparisons with other methods on math problems").

Our synthetic visual-centric datasets are fundamentally different from traditional visual-question paired mathematical instruction datasets commonly used in previous mathematical MLLMs. Unlike those methods, which typically rely on GPT-4o to generate diverse prompts and require human intervention to ensure quality, our approach is highly efficient and avoids the labor-intensive and costly process of manual curation. Specifically, our synthetic dataset is programmatically generated using Matplotlib for box-level shape grounding task, and using lightweight, free, publicly available models [Huang et al., 2018; Verbin et al., 2021] to extract junctions and boundaries as pixel-level ground truth for both our synthetic dataset and public mathematical training images (e.g., Geo170K). This efficient process avoids manual labeling and is explicitly used to train GeoGLIP for shape grounding, junction and boundary detection. Please refer to Section 3.4 and Appendix A.6 for a detailed description of the data curation process. Specifically, Section 3.4 provides an overview of the data curation process, while a flow diagram illustrating the data engine, along with examples of the synthetic diagrams, is presented in Appendix Fig. 6. Data statistics for the synthetic math-specific datasets, including the distribution of geometric shapes and the number of objects per image, are visualized in Figs. 5b and 5c.

Huang et al., Learning to parse wireframes in images of man-made environments.

Verbin et al., Field of junctions: Extracting boundary structure at low snr.

We have conducted experiments as suggested by the reviewer, directly providing geometric-relevant information to the model. Since no existing mathematical instruction datasets include detailed location information for geometric objects (e.g., bounding box coordinates or junction points), we generated this data by inferring Geo170K training images using GeoGLIP to extract the relevant location information. This information was appended to the special token in huam value supplementary descriptions for each image, using instructions such as: "there is a bounding box at ⟨x, y, w, h⟩ or there is a junction at ⟨x, y⟩ with lines directions <> ". When tested on the Geo170K test set of the GeoQA benchmark, the top-1 accuracy dropped from 67.0% to 63.2%. This result is close to the variant of our constant router 62.8% (assigning equal weights to all features, as explained in the dual visual encoder connector in response 1). This performance drop is consistent with our systematic analysis in Figs. b and c: Inaccurate instructions would harm the performance, and relevance is key—excessive visual cues interfere with problem-solving.

We appreciate the reviewer’s suggestion that directly providing geometric-relevant information in a proper manner may also lead to similar performance. Based on our experiments and observations, this proper method would require nearly 100% accurate grounding results for every mathematical object and highly relevant information tailored to the specific question. However, achieving this would demand significant human resources, including the involvement of mathematical experts.

Our approach instead leverages global pyramid feature maps that encode information ranging from geometry-rich to semantic-rich representations, with their contributions dynamically modulated by the feature router mechanism. Our research underscores the importance of addressing fine-grained visual understanding, a critical bottleneck in visual mathematical reasoning tasks. We hope our work could provide valuable insights for future research and emphasizes the need for more effective integration of fine-grained visual understanding in MLLMs.