Further mathematical justification and verification experiments

Thank you for the comment. We acknowledge that some of the underlying assumptions may lack sufficient rigor. Empirical validation of these assumptions is a valuable suggestion. In the following, we conduct experiments to verify the core assumptions proposed in Equations (5)–(7).

Equations (5): the core assumption is that the inpainting diffusion model has the ability to preserve unmasked regions with high fidelity. Therefore, we aim to demonstrate that the expression can effectively approximate the ground truth (GT) . To verify this, we conducted experiments using a pretrained diffusion inpainting model on our full test set. During the experiments, we added random noise to the inputs and sampled random timesteps t, then fed the model with partial content from the full image (background or object). We extracted predictions from the corresponding regions and used loss as the evaluation metric, i.e.,

As shown in the table below:

Training Loss (Avg.)	Region Merging Loss (Avg.)

Here, the training loss refers to the average loss during model training (as can be verified from Figure 3 of the paper). Notably, the merged result—obtained by feeding the model with separate parts and combining the corresponding outputs—yields a significantly lower loss than the training loss, supporting the validity of the assumption made in Equation (5).

Equations (6): the assumption in Equation (6) is that a property similar to that in Equation (5) also holds at the feature level. To better assess this from a similarity perspective, we adopt cosine similarity as the evaluation metric. Specifically, we compute the cosine similarity between feature representations from different attention layers on the test set, i.e.,

The results are presented below, where 'd', 'm', and 'u' denote down, mid, and up layers, respectively.

Layer
cos_sim	0.962	0.955	0.942	0.931	0.922	0.917	0.912	0.904	0.906	0.913	0.921	0.925	0.927	0.937	0.941	0.945

It can be observed that the cosine similarities are consistently high (> 0.9) across many layers, which supports the validity of our assumption in Equation (6).

Equation (7): It presents an existence assumption—namely, that it is possible to learn a transformation between features such that the fused features approximate the ground-truth (GT) features. We provide empirical support for this assumption in Figure 3 of the paper: (1) The feature similarity of different methods consistently decreases throughout the training process, indicating that the model is indeed learning to align the fused features with the GT features; (2) Low feature losses across multiple layers further demonstrate a high degree of feature consistency.

Regarding the sample size of GPT-4o

Thank you for the comment. The reason we evaluated GPT-4o on a subset of only 200 images is primarily due to practical constraints. GPT-4o's API inference is extremely slow, averaging about one minute per image, and incurs significant costs about \600, a substantial expense that is challenging for smaller research labs with limited budgets. However, if this evaluation approach is demonstrated to be broadly applicable and beneficial, we will consider conducting experiments on the full test set in future work.