We sincerely thank the reviewer for the valuable feedback and appreciate your recognition of our contributions, including clear motivation, a well-designed simulation pipeline and comprehensive benchmark. We will address all raised concerns as follows.

I understand that the purpose of is to highlight error maps and focus more on problematic areas. However, it is unclear why the authors chose polynomial regression with amplified attention instead of using frequency-domain techniques like FFT or HAAR, which could also re-weight error areas. This choice is not explained or validated through ablation studies.

Thank you for raising this insightful point. In early experiments we did test frequency-domain re-weighting: we decomposed prediction image with HAAR decomposition, amplified the low-frequency sub-bands, and used the recombined predicted image to compute the training loss. However, this approach consistently produced overly smooth, blurry results. In fact, the unequal scaling of sub-bands disrupted the balance between low- and high-frequency components, leading disrupted textures and noise patterns, thus hinding the network to learn to predict the correct textures that matchs the ground truth.

We further examined frequency-domain decomposition by training a model that employs a Haar-based re-weighted loss. In this variant, the prediction and ground truth are each decomposed into low- and high-frequency bands; separate losses are computed for the two bands, and the low-frequency term is assigned a larger weight to target subtle colour shifts. As the following table shows, this approach performs worse than our method. Moreover, band-wise re-weighting is feasible with losses such as or ; it is incompatible with perceptual or GAN objectives, which are shown crucial for image synthesis.

Method	FID	LPIPS	L1
ours loss （PixPerfect）	10.8675	0.1414	0.0363
Haar-based re-weighted loss	11.3816	0.1431	0.0375

By contrast, the polynomial-regression we ultimately adopted is an element-wise transformation in pixel space. Because it modifies each pixel without altering the spatial relationship with neighbouring pixels, it preserves subtle texture and noise patterns while still amplfying subtle color difference.

I also have doubts about the results in Figure 4. From my experience with FLUX-Fill, the boundary artifacts shown in column 2 don't seem as severe and obvious. I tested FLUX-Fill models during my review and did not observe such artifacts. If FLUX-Fill already produces decent results, the practical impact of the proposed method may be limited.

We sincerely thank the reviewer for taking the time to run hands-on tests. In addition, we would like to clearify our evaluation setting. In our experiment, we use the official FLUX-Fill model and the default hyper-parameters running on 512px resolution and paste the original background (Line 32, 254), following the standardized inpainting setting. We found that FLUX-Fill often produce inconsistent color or textures patterns that are different from the input background. In fact, the artifacts shown in Figure 4 persist across most samples in the testing sets. To facilitate reproducibility, we attached a minimal demo script at the end of this rebuttal that reproduces the boundary artifacts for the official FLUX-Fill model.

We want to further emphasize that such issues are common and fundemantal challenges for LDM local editing models. The compression nature of the latent space in LDMs inherently introduces distortion in low-level patterns. As a result, texture mismatches and noise artifacts remain difficult to avoid. Recent studies [1,2] further demonstrate that the latent space is inherently spatially entangled, making it ill-suited for enforcing pixel-accurate boundary consistency. These observations underscore the need for a pixel-space framework that can explicitly address and correct inconsistencies at the pixel level.

[1] Hou, Xingzhong, et al. "Towards Seamless Borders: A Method for Mitigating Inconsistencies in Image Inpainting and Outpainting." arXiv preprint arXiv:2506.12530 (2025).
[2] Wang, Yikai, et al. "Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency." CVPR'25.

Is the trained refiner a general module that works across different LDM inpainting models, or are there separate refiners for each model?

Thank you for the question. The refiner is a single, model-agnostic module. It is trained once and then applied without any additional fine-tuning to all LDM-based inpainting and local-editing backbones evaluated in the paper.

Since the training uses the synthetic artifact simulation pipeline, does this mean the LDM inpainting model is not involved during the refiner's training?

Yes. Our artifact simulation pipeline is agnostic to different diffusion-model. This design enables our trained refiner to serve as a general-purpose refinement module for various LDM-based editing pipelines without retraining. (L166-L169)

What base model is used in Figure 5?

As indicated in the left caption of Figure 5, the base model is OmniPaint for object insertion and Anydoor for object removal.

The font size in the tables is inconsistent, for example, in Table 4.

Thanks for pointing out. We will fix this issue in the final version.

Demo code for reproducing FLUX-Fill seams The following snippet provides a minimal setup for observing hue seams and texture discontinuities introduced by FLUX-Fill. Please ensure the model weights and dependencies are properly set up.

import torch
import numpy as np
from PIL import Image, ImageDraw
from diffusers import FluxFillPipeline
from diffusers.utils import load_image

# === Define input image path ===
input_image_path = "/your/image/path" # TODO: change to the input image path

# === Load input image ===
image = load_image(input_image_path).convert("RGB")
width, height = image.size

# === Generate irregular mask ===
def generate_irregular_mask(width, height, max_shapes=5):
    mask = Image.new("L", (width, height), 0)
    draw = ImageDraw.Draw(mask)

    for _ in range(np.random.randint(1, max_shapes + 1)):
        shape_type = np.random.choice(["ellipse", "polygon"])
        if shape_type == "ellipse":
            x0, y0 = np.random.randint(0, width - 50), np.random.randint(0, height - 50)
            x1, y1 = x0 + np.random.randint(40, 120), y0 + np.random.randint(40, 120)
            draw.ellipse([x0, y0, x1, y1], fill=255)
        else:
            num_points = np.random.randint(3, 8)
            points = [(np.random.randint(0, width), np.random.randint(0, height)) for _ in range(num_points)]
            draw.polygon(points, fill=255)

    return mask.convert("RGB")

mask = generate_irregular_mask(width, height)

# === Load FLUX inpainting pipeline ===
pipe = FluxFillPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-Fill-dev",
    torch_dtype=torch.bfloat16
).to("cuda")

# === Run FLUX-Fill ===
output = pipe(
    image=image,
    mask_image=mask,
    prompt="",
    height=height,
    width=width,
    guidance_scale=30, # The default value provided on the official huggingface page
    num_inference_steps=50, # The default value provided on the official huggingface page
    max_sequence_length=512 # The default value provided on the official huggingface page
).images[0]

# === Composite: restore unmasked regions from original image ===
image_np = np.array(image)
output_np = np.array(output)
mask_np = np.array(mask.convert("L"))
inpainted_np = output_np.copy()
inpainted_np[mask_np < 128] = image_np[mask_np < 128]
inpainted = Image.fromarray(inpainted_np)

# === Save outputs ===
image.save("original.png")
mask.save("mask.png")
inpainted.save("inpainted.png")