Thanks for the constructive suggestions. The detailed responses to the reviewer's comments are as follows.

Q1: Some statements could be misunderstood and the paper would benefit from a more precise formulation.

Thank you for pointing out the inaccurate statements. We will carefully check the manuscript to modify or remove the statements in the revised version.

Q2: Comparison and discussion to task-specific solvers

We conducted additional experiments on the task-specific solvers and multimodal LLMs GPT-4V and GPT-4o. Since UCGS-T can handle both selective and generative problems, we compare it to generative task-specific solvers PrAE [1], NVSA [2], GCA [3], ALANS [4] and RAISE [5] that support answer selection and generation. The task-specific solvers are trained separately on each dataset without the supervision of rule and attribute annotations to ensure consistency with the experimental setup of UCGS-T.

Models	RAVEN	PGM
PrAE [1]	13.6	-
NVSA [2]	11.5	-
GCA [3]	37.3	31.7
ALANS [4]	50.1	-
RAISE [5]	54.5	14.0
GPT-4V	13.8	-
GPT-4o	19.2	-
GPT-4o + Language Descriptions	38.8	-
UCGS-T (Ours)	64.6	38.1

Experimental results show that UCGS-T outperforms generative task-specific solvers under the same setup. PrAE and ALANS cannot solve PGM problems since they only define visual concepts and abstract rules of RAVEN. Without additional annotations, task-specific solvers suffer significant performance drops, e.g., PrAE’s accuracy drops from 65.0% (reported by [1]) to 13.6% in our experiments. We exhibit the results of GPT-4V and GPT-4o on RAVEN from [6], where both models show lower accuracy than UCGS-T, ALANS and RAISE. The multimodal LLMs are not designed for abstract visual reasoning, therefore solving such tasks may require further prompt engineering and chain-of-thought design.

[1] Abstract spatial-temporal reasoning via probabilistic abduction and execution.

[2] A neuro-vector-symbolic architecture for solving Raven’s progressive matrices.

[3] Generating correct answers for progressive matrices intelligence tests.

[4] Learning algebraic representation for systematic generalization in abstract reasoning.

[5] Towards Generative Abstract Reasoning: Completing Raven’s Progressive Matrix via Rule Abstraction and Selection.

[6] What is the visual cognition gap between humans and multimodal llms?

Q3: Doubts regarding the replacement of the last image in Proposition 3.6

SVRT provides an image set for each rule containing different images generated by computer programs that follow the rule. Each image is assigned to only one panel of Bongard Problems (BPs). If the remaining images of a set are insufficient to form a complete panel, they will be dropped. Therefore, there are no duplicate images between BP panels even if one image is removed from the left and right problem panels.

Q4: The main paper is missing descriptions and references for the G1-set and SVRT datasets, while the supplementary material provides these.

Thank you for the suggestion. We will add the descriptions and references in the first paragraph of Experiments.

Q5: Is the model training of VQVAE and transformer done end-to-end or as usual one after another?

VQVAE is pretrained before training the remaining modules. We leave the image reconstruction loss in Eq. 9 to make it possible to finetune VQVAE end-to-end in the training stage. But we find that setting \lambda to 0 (i.e., freezing the parameters of VQVAE) is the best choice. Please refer to Appendix B.1 for the detailed descriptions.

Q6: Regarding the lack of clarity w.r.t. the used terms

In this paper, tasks refer to different abstract visual reasoning tasks such as RPM and BP. Problems refer to individual problem instances. Definitions describe the form of different tasks. Visual concepts refer to image attributes like object size and color. Rules are changing patterns on attributes (e.g., progressive changes). Thanks for the helpful suggestion to make the terms clear. We will introduce the terms at the beginning of the method section.

Q7: The motivation of including an object-centric baseline

Some task-specific solvers [1, 2] have realized abstract visual reasoning with object-centric representations. Therefore, we adopt it as one of the typical approaches to explore the performance of different visual representations in UCGS.

[1] Learning to reason over visual objects.

[2] Systematic visual reasoning through object-centric relational abstraction.

Q8: Other comments or suggestions about writing

Thanks for the constructive comments. In the revised version, we will clarify that the baselines are ablations on model architecture when introducing the baselines. And we will also remove the repeated parts (Lines 64-70 and the first paragraph in Section 3.1) in the manuscript.