Thank you for your thorough review and constructive feedback on our manuscript. We appreciate the time and effort you have invested in evaluating our work. Your detailed comments and suggestions have been invaluable in helping us improve the paper. Below, we provide point-by-point responses to address your concerns and questions.

Q1 Details of Graph Reasoner

Details of CoT

We illustrate the prompt of Graph Reasoner in supplementary materials. We will add more details of step-by-step CoT in revision version.

Add Graph into Diffusion Blocks

We convert the predicted articulation graph into a binary adjacency matrix , where is the number of predicted parts. Each entry indicates the existence of a valid articulated connection from part to part .

This matrix is used as a structural prior to guide the attention computation in the diffusion model. Specifically, given the latent feature matrix , the attention is computed as:

where are learnable projection matrices, and is a small constant (e.g., ) to avoid taking the logarithm of zero. The term serves as a structural attention bias, encouraging the model to focus on physically valid connections during generation.

Details of LLM

We use the API of GPT-4o in Graph Reasoner. The parameters of GPT-4o is about 200B acoording to the official report.

Q2 Generalization Ability of DIPO

Experimental Settings

Our experiments include totaling seven diverse categories from the PartNet-Mobility dataset, which are Storage Furniture, Table, Refrigerator, Dishwasher, Oven, Washer and Microwave. We select these categories to follow the experimental setting of SINGAPO [?] for a fair comparison. There are no strong category-specific assumptions on our diffusion model and graph reasoner.

In addition, the ACD dataset provides a finer-grained categorization of object classes. The detailed mapping between PM and ACD categories used in our evaluation is shown below.

PM Category	ACD Category
StorageFurniture	HangingCabinet
	Cabinet
	Armoire
	ChestOfDrawers
	KitechenCabinet
	Bookcase
	TvStand
	SinkCabinet
Table	Table
	Desk
	Nightstand
Refrigerator	Refrigerator
Oven	Oven
Dishwasher	Dishwasher
Washer	Washer
Microwave	Microwave

Complex Objects

However, our method may fail in some extremely complex objects like robots. Moreover, assets such as keyboards with a large number of visually similar and spatially redundant parts pose challenges for feature extraction using DINOv2, as the part-level features may lack sufficient discriminability. To address this, a promising direction is to incorporate segmentation-aware priors. In particular, we plan to explore integrating SAM-derived segmentation masks to guide part proposal and filtering, enabling more robust articulation reasoning in such densely repetitive structures.

Occlusion and Unchanging Appearance in Dual-State Images

In cases where the articulated-state image provides limited additional information due to occlusion (e.g., a dishwasher tray) or minimal appearance change (e.g., a globe with uniform texture), the dual-state input may not fully reveal articulation differences. However, our input setting inherently contains at least as much geometric and structural information as single-image methods like SINGAPO, ensuring a comparable performance lower bound.

To better handle occlusion, we consider incorporating sparse multi-view images (e.g., 2–3 views) as a future extension, which can alleviate visibility issues without incurring significant data collection cost.

In addition, nearly identical dual-state images can still provide useful priors. For example, spherical objects like globes often imply rotation around a central axis, which can guide plausible articulation inference.

Q3 Impact of Partially Opened Articulated Images on Joint Range

Our method remains robust to partially opened articulated-state inputs — that is, the predicted joint type and axis remain stable. However, incomplete articulation may influence the estimated joint range, since the model learns this from the observed displacement between states.

For practical use, we recommend using fully opened articulated images to better reveal the maximum joint extent and improve articulation estimation.

To assess the effect of articulation coverage, we additionally render a variant of the test set where the articulated-state poses are randomly sampled (not fully opened), and evaluate our model on this setting. The results are illustrated in the table below.

Input Setting	RS-	AS-	RS-	AS-
Partially Opened	0.4987	0.5084	0.0421	0.1031
Fully Opened	0.4561	0.4683	0.0359	0.0732

Q4 Training and Inference Time

Training our full pipeline on eight NVIDIA RTX 4090 GPUs takes approximately 23 hours.

During inference, we use a pair of images as input and predict an articulated object with 5 parts. The time consumption of each stage in the inference pipeline is listed in the table below.

Stage	Time (s)
DINO Feature Extraction	2.63
Graph Reasoner	12.62
Diffusion Model	3.54
Retrieval	8.97