PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
3.3
置信度
创新性2.8
质量2.8
清晰度3.0
重要性3.0
NeurIPS 2025

EditInfinity: Image Editing with Binary-Quantized Generative Models

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
Image Editing; Text-to-Image Generation; Autoregressive Models

评审与讨论

审稿意见
5

This paper introduces EditInfinity - an editing model built on the top of Infinity T2I. It offers the advantages compared to diffusion model because VQ based generative model has the ground truth of the intermediate outputs. It learns the learnable prompt and LORA for text prompt rectification and image style preservation, and propose a smoothing method through piecewise linear kernel and multi-scale autoregressive. It freezes the original Infinity to have minimal training costs.

优缺点分析

Strengths

  • This model is one of first few AR model which can have editing capabilities with great quality - this is the top strength to me
  • It freezes infinity model so the method is very efficient.
  • Some improvements for alignment and style preservation through learnable prompts and LORA
  • Blended style multi-style AR is interesting.
  • Results are good based on benchmark scores and user study

Weakness

  • It missed some top editing models in close source like Gpt-4o and gemini for comparisons.
  • Ablation quality could be improved for multi-scale autoregressive editing. It only has one visualization example and it is difficult to know whether it is cherry picked.
  • <minor> It is better have a definition of all symbols in Figure 2 and Figure 3, as well as training strategy (stage 1, stage 2, ....).

问题

1/ Please add more comparisons with top models like gpt-4o or Gemini. 2/ Better ablation study for multi-scale autoregressive editing instead of putting one visualization example 3/ Better write-up for method section 4, including a/ better definition of symbols, b/ training strategy (stage 1, stage 2, ....)

局限性

yes

最终评判理由

After reading all rebuttals and other review feedback, I am leaning to weak accept to accept. VQ based generation models are quite behind diffusion one, and this work achieves solid results across multiple edit tasks compared to diffusion base models. I think this work can be a good milestone towards VQ based editing work.

格式问题

No formatting issues.

作者回复

Thank you for your insightful comments. We respond to your concerns point by point as follows.

Main Comment

Q1: Please add more comparisons with top models like gpt-4o or Gemini.

A1: We sincerely thank the reviewer for pointing out this valuable concern. As shown in Table 1 below, we provide a comparison with the closed-source Gemini, which serves as a strong commercial baseline. Due to time constraints, we are not able to include results for GPT-4o. Nonetheless, we believe the comparison with Gemini provides a representative and meaningful reference point for assessing our method. As shown in Table 1, the thoughtful design of our method enables it to outperform even the closed-source Gemini.

Table 1: Quantitative results of multi-scale autoregressive editing.

MethodPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑IR(×10)↑
P2P17.87208.80219.8871.1425.0122.440.017
MasaCtrl22.17106.6286.9779.6723.9621.16-1.66
P2P-Zero20.44172.22144.1274.6722.8020.54-6.59
NTI27.0360.6735.8684.1124.7521.862.77
PnP-Inv22.46106.0680.4579.6825.4122.624.17
NP26.2169.0139.7383.4024.6121.872.42
StableFlow21.6492.28115.2184.9424.6521.701.88
RF-Edit23.22131.1875.0081.4425.2222.405.18
Gemini23.22105.17188.6381.1025.2822.285.30
EditInfinity (Ours)27.9533.0824.2792.1226.4123.475.88

Q2: Better ablation study for multi-scale autoregressive editing instead of putting one visualization example.

A2: We thank the reviewer for this insightful question. Due to space limitations in the main paper, this has been discussed in Table 4 of the supplementary material.

Table 4: Quantitative results of multi-scale autoregressive editing.

BlendPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑IR(×10)↑
NAR25.5042.5938.3991.0025.9823.643.54
AR28.5031.5822.9492.3626.2223.995.39

By blending source tokens at each scale in an autoregressive (AR) manner, our method provides continuous guidance for editing region generation at subsequent scales. In contrast, the non-autoregressive (NAR) approach blends source tokens only at the end of each scale, without influencing the token generation process at the next scale. This leads to incoherent transitions and visually inconsistent edits, as illustrated in Figure 9 of main paper and Figure 2 of supplementary material. Table 4 here further supports this observation: AR consistently outperforms NAR in both background preservation and text alignment. These results highlight the necessity of autoregressive guidance for achieving harmonious and realistic edits.

Q3: Better write-up for method section 4, including a/ better definition of symbols, b/ training strategy (stage 1, stage 2, ....)

A3: We sincerely thank the reviewer for pointing out the need for clearer symbol definitions in Figures 2 and 3, as well as a more structured presentation of the training strategy in Section 4.

(a) We have added dedicated legend areas to Figures 2 and 3 to explain the meaning of all relevant symbols, such as tsou,tins,tl,R1...Ksou,R1...Kinv,R1...Ktar,E1...Ktar,Linv,Gt_{sou}, t_{ins}, t_{l}, R_{1...K}^{sou}, R_{1...K}^{inv}, R_{1...K}^{tar}, E_{1...K}^{tar}, \mathcal{L}_{inv}, \mathcal{G}, etc. As the rebuttal does not support figures, we are unable to show the updated figures here.

(b) We have clearly annotated Stage 1 and Stage 2 in Figure 2 and provided a more precise and structured definition of both stages in Section 4, as outlined below.

  • Image inversion stage 1: textual prompting rectification.
  • Image inversion stage 2: image style preservation.

We greatly appreciate your suggestions, which will help improve the clarity and rigor of our method section.

评论

Thank you for the thoughtful rebuttal and the authors’ detailed responses. The comparison with Gemini strengthens the validity of the results. This is an excellent contribution to an underexplored area in VQ-based editing, and I believe it will attract interest from researchers in the field. I will maintain my positive score.

审稿意见
4

EditInfinity introduces a novel text-driven image editing method that leverages binary-quantized generative models as an alternative to traditional diffusion-based approaches. By utilizing discrete token representations, the method enables precise image inversion with exact supervision and avoids the approximation errors typical of diffusion inversion. The authors enhance this process through a learnable prompt mechanism and LoRA-based style preservation. For the editing phase, they propose a holistic smoothing strategy that blends source and target tokens using a piecewise linear kernel, allowing for seamless and localized modifications. Evaluated on the PIE-Bench benchmark, EditInfinity achieves state-of-the-art results in both background preservation and semantic alignment, significantly outperforming diffusion-based baselines across a variety of editing tasks.

优缺点分析

Strengths

  1. Clear Motivation and Novelty: The paper addresses limitations of diffusion-based editing, especially inversion error, and innovatively adapts a VQ-based model, which is a less explored direction.
  2. Thorough Evaluation: The authors conduct extensive quantitative and qualitative experiments, including a user study, to demonstrate improvements across various editing tasks. EditInfinity outperforms existing methods in key metrics, especially in preserving unedited regions while aligning edits with prompts.

Weaknesses

  1. Structured Editing Setup: The method heavily depends on well-aligned source-target prompt pairs and user-provided region masks. This reliance on structured inputs may limit its generalizability to open-ended or ambiguous editing tasks where such precise supervision is unavailable.
  2. Computation Cost: The framework involves several non-trivial steps, including prompt optimization, LoRA fine-tuning, and multi-scale autoregressive token generation. This complexity could hinder real-time applications and make the method less practical.
  3. Manual Design: The framework includes several manually crafted components, such as user-defined editing masks and a fixed smoothing kernel. These elements introduce human priors into the pipeline, and it remains unclear how robust the approach is to noisy inputs or complex, less structured scenes.

问题

  1. How well does the method handle vague or high-level prompts that lack explicit object references or structure?
  2. The inversion and editing stages seem computationally demanding. What is the latency for a single edit?
  3. How is the edited region determined in practical scenarios? Can the method infer edit regions automatically based on differences between source and target prompts, or does it rely entirely on manual mask annotation?

局限性

Please refer to the weaknesses section.

最终评判理由

Thank authors for the clarification and effort, and I have decided to increase my rating. But the method can be further improved with more flexible setting and better efficiency.

格式问题

None

作者回复

We sincerely appreciate your thoughtful and constructive feedback. Below, we respond to each of your comments in detail and have incorporated the corresponding revisions into the final version of the paper.

Main Comment

Q1: How well does the method handle vague or high-level prompts that lack explicit object references or structure?

A1: We sincerely thank you for highlighting this concern. We provide a detailed response from the following two perspectives: ’Prompt Understanding‘ and ’Open-ended Editing Tasks‘.

(a) Prompt Understanding. Our method is designed to operate with known editing regions and does not focus on prompt understanding. In scenarios involving vague or high-level prompts without clear object references, external multimodal large language models (MLLMs) may be integrated to enhance prompt understanding.

(b) Open-ended Editing Tasks. Given that PIE-Bench contains nearly 50% natural images, it already serves as a comprehensive benchmark for assessing our method. However, to specifically evaluate complex scenes, we select 20 MagicBrush images (due to time constraints) filtered by GPT-4o, comprising five samples each with 2, 3, 4, and 5 primary objects. Table 1 demonstrates the superiority of our method in handling complex scenes.

Table 1: Quantitative results on the complex scene images from MagicBrush .

MethodPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑
NTI8.81452.031380.4839.6319.9016.83
RF-Edit26.00121.8433.9884.7324.1318.29
Ours31.2324.309.9091.7024.1920.07

Note: Due to time constraints, we select representative methods with strong performance from both diffusion U-Net-based models and diffusion transformer-based models for comparison.

Q2: The inversion and editing stages seem computationally demanding. What is the latency for a single edit?

A2: We conduct a runtime comparison of our method and other methods on a single NVIDIA L20 GPU, measuring both inversion and editing time, as shown in Table 2 below. A key advantage lies in its efficiency during multiple edits on one image—a common real-world scenario. Once the inversion for a given image is completed, subsequent edits can be performed within 3.64 seconds—over 7× faster than other methods on average (initial inversion time only 4× longer than other methods on average). This design effectively front-loads the computational cost.

Table 2: Runtime comparison of inversion and editing.

MethodInversionOnce EditingTwice Editing
P2P14.40s10.28s10.28s
MasaCtrl5.19s17.45s17.45s
P2P-Zero13.31s62.29s62.29s
NTI95.54s10.32s10.32s
PnP-Inv8.32s9.54s9.54s
NP9.00s10.37s10.37s
StableFlow13.85s27.20s27.20s
RF-Edit55.48s54.07s54.07s
Ours107.06s3.64s3.64s

Q3: How is the edited region determined in practical scenarios? Can the method infer edit regions automatically based on differences between source and target prompts, or does it rely entirely on manual mask annotation?

A3: We appreciate this insightful question. Our method assumes the user provides masks. Indeed, this is a well‑established task setting in image editing [1] [2], especially when text alone is insufficient for the precise localization of the user-desired editing region. This challenge of accurately conveying user intent has long been recognized in controllable image generation. To enhance controllability, ControlNet [3] leverages visual priors such as edge maps, while DreamBooth [4] utilizes user-provided images to capture detailed features not easily conveyed by text.

While our method assumes user-provided masks by default, it can also leverage Infinity’s cross-attention maps [5] for automatic mask generation without modifying the framework. Specifically, we automatically align differing words xx between the source and target prompts. After completing inversion, we input the source or target prompt containing xx into Infinity and extract the cross-attention map corresponding to xx. A threshold is then applied: values above the threshold are set to 0 (mask foreground), and others to 1 (background). Table 3 shows that our method is not highly sensitive to the source of the mask—strong performance is achieved in both cases. We will include further discussion on this aspect in the final version.

Table 3: Quantitative results on the random class of PIE-Bench. Ours-u denotes user-provided masks; Ours-c denotes cross-attention masks. Best and second-best results are shown in bold and italics.

MethodPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑IR(×10)↑
P2P18.81197.11197.6973.6825.1022.980.29
MasaCtrl23.3695.4577.6381.8823.3020.92-3.82
P2P-Zero20.92161.28137.6477.0222.8921.09-5.71
NTI28.0857.9436.1085.1724.7122.513.63
PnP-Inv23.60103.1272.7781.1125.0522.943.34
NP27.2462.4037.7984.9224.8922.672.92
StableFlow23.6872.7778.6188.1123.1721.210.76
RF-Edit27.2692.2734.4686.6724.6522.030.61
Ours-c27.4744.9746.9190.3025.7123.225.40
Ours-u28.5031.5822.9492.3626.2223.995.39

Note: Due to time constraints, experiments are conducted on the “random class” of PIE-Bench, which includes all editing types and supports efficient and unbiased evaluation of model components.

[1] Nitzan Y, et al. Lazy diffusion transformer for interactive image editing[C]//ECCV. 2024.

[2] Zhuang J, et al. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting[C]//ECCV. 2024.

[3] Zhang L, et al. Adding conditional control to text-to-image diffusion models[C]//ICCV. 2023.

[4] Ruiz N, Li Y, et al. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation[C]//CVPR. 2023.

[5] Amir Hertz, et al. Prompt-to-Prompt Image Editing with Cross-Attention Control[C]//ICLR. 2023.

评论

Thank you again for your valuable and constructive feedback. There is still some time left in the rebuttal phase. We sincerely want to know if our response has addressed your concerns. If you are willing, we would be happy to continue the discussion to understand your ideas better.

审稿意见
4

This paper introduces EditInfinity, which is a text-based image editing method based on the autoregressive text-to-image generation model Infinity, which succeeds in high quality generation among vector quantized models available. The authors motivate the paper from the existing inversion-then-editing approach, where instead of inverting into a noisy latent (like in diffusion models), they initially invert the input text prompt by an extended embedding representation, which is followed by LoRA based adaptation for the reconstruction of precise details. Given a target text prompt, an input image and a region of interest specified by a mask, EditInfinity performs editing by manipulating the multi-scale quantized feature maps with a smoothed mask G\mathcal{G}, then reconstruct the image using these manipulated feature maps. Authors demonstrate the effectiveness of the approach by using PIE-Bench, where they compare their approach with diffusion/flow-matching based editing methods. The provided quantitative analyses and qualitative results shows the effectiveness of the method.

优缺点分析

Strengths

  • The paper tackles the image editing problem using autoregressive image editing methods with a new inversion scheme which mainly depends on approximating the ideal generation prompt, and minor adjustments on the autoregressive model with LoRA-based optimization.
  • The proposed method achieves succeeds in background preservation compared to existing approaches
  • The editing pipeline proposed provides satisfactory results on edits that can be constrained with a region identified by a mask.
  • Authors provide ablations for the design decisions made for each of the components of the method.

Weaknesses

  • The limitations of the method is discussed very briefly, where the authors should provide more qualitative examples for the failure cases.
  • The edits that can be performed with the proposed approach seems limited with an input mask, where providing the whole image as the editing region of interest seems to be disturbing the background preservation properties and the preservation of the input pose, structure (see Fig. 6, row 9). If there is such limitation, the authors should explicitly discuss this for transparency.
  • There are missing details in the quantitative evaluation setup and method details. For the details on the method, the proposed approach involves two-stage training per image. To fairly assess if this approach is feasible as an editing method or not, the authors should provide a time comparison with the competing approach in terms of the overhead on top of the base model. In addition, the evaluation requires an input mask for EditInfinity, where the method used to obtain these masks are not described in detail. The authors should privide such details on the evaluation setup.
  • Given the short description about the limitations from the authors, the proposed method succeeds in masked edits. While this is a valid use case for editing, it does not cover a majority of the edits to be performed. As an example, editing facial attributes is nor always maskable (e.g. adding smile). It is not clear if the method can perform such edits or not, if it can there should be examples on such detailed cases. In addition, the editing examples provided are rather simplistic cases where there is a single subject that is centered in the image. This makes the performance in on-the-wild images questionable. The authors should provide examples for such use cases.

问题

  • How are the masks used in the editing process obtained in the evaluations? Does the method assume that these masks are provided by the user or is there an automated pipeline to achieve this?
  • Is the method able to achieve disentanglement for un-maskable edits such as facial attribute change?
  • What are the failure cases of the method in detail?
  • Is there are timing constraint on the method, as a two stage optimization is required for editing?

局限性

  • The limitations are very briefly addresses over a small discussion. I believe that further details should be included like failure cases as discussed over the Questions and Strengths And Weaknesses sections. In addition the authors are encouraged to include a timing based comparison due to the two stage optimization that the method requires (for each editing example). For further comments on suggested improvements over the limitations, please refer to Questions and Strengths And Weaknesses sections.

最终评判理由

The rebuttal addressed my questions regarding the mask sensitivity, identity preservation, performance on "on-the-wild" images and evaluation details. After reviewing the author rebuttal and the other reviews, I am leaning towards positive.

格式问题

The paper is adequately formatted.

作者回复

Thanks for your careful and valuable comments. We will explain your concerns point by point.

Main Comment

Q1: How are the masks used in the editing process obtained in the evaluations? Does the method assume that these masks are provided by the user or is there an automated pipeline to achieve this?

A1: We appreciate this insightful question. Our method assumes the user provides masks. Indeed, this is a well‑established task setting in image editing [1] [2], especially when text alone is insufficient for the precise localization of the user-desired editing region. This challenge of accurately conveying user intent has long been recognized in controllable image generation. To enhance controllability, ControlNet [3] leverages visual priors such as edge maps, while DreamBooth [4] utilizes user-provided images to capture detailed features not easily conveyed by text.

While our method assumes user-provided masks by default, it can also leverage Infinity’s cross-attention maps [5] for automatic mask generation without modifying the framework. Specifically, we automatically align differing words xx between the source and target prompts. After completing inversion, we input the source or target prompt containing xx into Infinity and extract the cross-attention map corresponding to xx. A threshold is then applied: values above the threshold are set to 0 (mask foreground), and others to 1 (background). Table 1 shows that our method is not highly sensitive to the source of the mask—strong performance is achieved in both cases. We will include further discussion on this aspect in the final version.

Table 1: Quantitative results on the random class of PIE-Bench. Ours-u denotes user-provided masks; Ours-c denotes cross-attention masks. Best and second-best results are shown in bold and italics.

MethodPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑IR(×10)↑
P2P18.81197.11197.6973.6825.1022.980.29
MasaCtrl23.3695.4577.6381.8823.3020.92-3.82
P2P-Zero20.92161.28137.6477.0222.8921.09-5.71
NTI28.0857.9436.1085.1724.7122.513.63
PnP-Inv23.60103.1272.7781.1125.0522.943.34
NP27.2462.4037.7984.9224.8922.672.92
StableFlow23.6872.7778.6188.1123.1721.210.76
RF-Edit27.2692.2734.4686.6724.6522.030.61
Ours-c27.4744.9746.9190.3025.7123.225.40
Ours-u28.5031.5822.9492.3626.2223.995.39

Note: Due to time constraints, experiments are conducted on the “random class” of PIE-Bench, which includes all editing types and supports efficient and unbiased evaluation of model components.

[1] Nitzan Y, et al. Lazy diffusion transformer for interactive image editing[C]//ECCV. 2024.

[2] Zhuang J, et al. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting[C]//ECCV. 2024.

[3] Zhang L, et al. Adding conditional control to text-to-image diffusion models[C]//ICCV. 2023.

[4] Ruiz N, Li Y, et al. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation[C]//CVPR. 2023.

[5] Amir Hertz, et al. Prompt-to-Prompt Image Editing with Cross-Attention Control[C]//ICLR. 2023.

Q2: Is the method able to achieve disentanglement for un-maskable edits such as facial attribute change?

A2: We thank the reviewer for this insightful question.

(a) Facial attribute change. Yes, our method can realize facial attribute change. PIE-Bench, used for evaluation in main paper, is a comprehensive image editing benchmark covering natural human-centric scenarios. We use GPT-4o to filter editing instructions, resulting in 16 representative examples such as 'change the expression from laughing to angry' and 'add the age of the woman'. Our method performs favorably on these cases. Please refer to Table 1 in the main paper for qualitative results, given image upload constraints.

|-- random
|-- change_object
     |-- natural*
          |-- animal
          |-- human*
          |-- indoor
          |-- outdoor
     |-- artificial
          |-- ...
|-- add_object
     |-- ...
|-- delete_object
     |-- ...
|-- change_attribute_content
     |-- ...
|-- change_attribute_pose
     |-- ...
|-- change_attribute_color
     |-- ...
|-- change_attribute_material
     |-- ...
|-- change_background
     |-- ...
|-- change_style
     |-- ...

Furthermore, we randomly select 20 face images from FFHQ (due to time constraints) to perform unmasked facial attribute changes, including age, expression, skin tone, and more. Since the setting is unmasked, there is no background to preserve, and thus standard metrics that rely on background consistency are not applicable. In addition to retaining the Whole metric (CLIP score between the entire edited image and the target prompt), we introduce ArcFace [6] for evaluating identity preservation, and CLIP-I for measuring similarity between the source and edited images. The results are reported in Table 2 below, which shows that our method achieves the best performance compared to others. Thanks to our proposed image inversion algorithm, which effectively learns the generative trajectory of the source image, our method can accomplish the intended edits.

Table 2: Quantitative results on facial images from FFHQ.

MethodArcFace↑CLIP-I↑Whole↑
NTI0.560.8323.67
RF-Edit0.610.7923.54
Ours0.630.8624.82

Note: Due to time constraints, we select representative methods with strong performance from both diffusion U-Net-based models and diffusion transformer-based models for comparison.

[6] Jiankang Deng, et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition[C]//CVPR. 2019.

(b) On-the-wild images. Given that PIE-Bench contains nearly 50% natural images, it serves as a comprehensive benchmark for assessing our method. However, to specifically evaluate complex scenes, we select 20 MagicBrush images (due to time constraints) filtered by GPT-4o, comprising five samples each with 2, 3, 4, and 5 primary objects. Table 3 demonstrates the superiority of our method in handling complex scenes.

Table 3: Quantitative results on the complex scene images from MagicBrush .

MethodPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑
NTI8.81452.031380.4839.6319.9016.83
RF-Edit26.00121.8433.9884.7324.1318.29
Ours31.2324.309.9091.7024.1920.07

Q3: What are the failure cases of the method in detail?

A3: We appreciate your valuable suggestion. In the final version, we will provide additional failure cases and elaborate on the limitation as follows.

Limitation. While our method demonstrates strong performance across diverse editing tasks, it shows limitations in extreme cases such as style change, where no background needs to be preserved, and the image contains detailed structural patterns. In such cases, the blending between source and target tokens is constrained, which may lead to suboptimal preservation of structural fidelity from the original image. Nonetheless, thanks to our image inversion strategy, which effectively learns the generative trajectory of source image, our method can still accomplish intended edits, despite slight structural degradation. In contrast, other methods often fail in such challenging scenarios. For example, as shown in Figure 6, row 9 of the main paper, while other methods are unable to convert the painted bird into a realistic one, our method successfully achieves the style change, with only minor deviation in the bird's head pose.

Q4: Is there are timing constraint on the method, as a two stage optimization is required for editing?

A4: We conduct a runtime comparison of our method and other methods on a single NVIDIA L20 GPU, measuring both inversion and editing time, as shown in Table 4 below. A key advantage lies in its efficiency during multiple edits on one image—a common real-world scenario. Once the inversion for a given image is completed, subsequent edits can be performed within 3.64 seconds—over 7× faster than other methods on average (initial inversion time only 4× longer than other methods on average). This design effectively front-loads the computational cost.

Table 4: Runtime comparison of inversion and editing.

MethodInversionOnce EditingTwice Editing
P2P14.40s10.28s10.28s
MasaCtrl5.19s17.45s17.45s
P2P-Zero13.31s62.29s62.29s
NTI95.54s10.32s10.32s
PnP-Inv8.32s9.54s9.54s
NP9.00s10.37s10.37s
StableFlow13.85s27.20s27.20s
RF-Edit55.48s54.07s54.07s
Ours107.06s3.64s3.64s
评论

Thanks to the authors for the rebuttal and the experiments reported during the limited time allocated for rebuttals. Overall my concerns regarding the sensitivity of the mask (with alternative extraction methods), facial attribute preservation during editing, timing constraints and on the wild images are addressed. I appreciate the authors reporting metrics such as ID preservation in this regard. Considering the reviews of the other reviewers and the fact that the paper makes a cntribution over a new type of generative model. I am increasing my score.

审稿意见
4

This paper proposes a method to adapt a type of pretrained text conditioned image generative models to the task of image editing. The model in question is Infinity, an autoregressive model on binary residual quantized tokens in the image scale space. The proposed method first conduct text inversion with learnable language style tokens and then performs editing with soft masking in the generation trajectory.

The proposed method is tested on PIE benchmark with add, delete, and change operations. It shows better image editing capability than other training-free baselines.

优缺点分析

Strengths:

  • The idea of using the Infinity model, which provides exact decomposition of images, as a trajectory to reference seems interesting. The proposed masking technique appears to provide good background preservation.
  • Based on the provided visual results, the proposed method can accurately achieve addition, deletion, and change.
  • The cost of adapting the pretrained text to image model is minimal thanks to the use of LoRA.

Weakness

  • It is unclear how the text style token t_l is optimized or learned.
  • The blending operation is not described clearly. Since the Infinity model uses quantized tokens, it is unclear how linear blending happens between two token indexes. The method seems tailored to the infinity model; it is unclear whether it can be generalized to other types of image generative models.

问题

I would like to encourage the authors provide more detailed explanation on the optimization of t_l and the blending of quantized tokens.

Also, I would like to see clarification on the experimental results. Are other baseline methods also using Inifinity as the base model? If not could the worse results for these methods be due to their worse base models?

局限性

Yes

最终评判理由

The authors have addressed my concerns in the response. The method described with the added context seems reasonable and is achieving good editing quality. I am recommending accept.

格式问题

N/A

作者回复

Thanks for your comments. We will explain your concerns point by point.

Main Comment

Q1: I would like to encourage the authors provide more detailed explanation on the optimization of t_l and the blending of quantized tokens.

A1: We appreciate the reviewer’s helpful suggestion. We now explain in detail (i) how our method optimizes tlt_l, and (ii) how the blending of quantized tokens is performed. We have also refined the narrative in the main paper to make these components clearer and better integrated.

(a) The optimization of tlt_l:

First, we introduce learnable prompt embeddings tlt_l, whose size is fixed to 20 tokens. Second, we pass both the source prompt tsout_{sou} and the instruction prompt tinvt_{inv} (i.e., "the language style of this prompt is") through text encoder of Infinity Ψ()\Psi(\cdot) to obtain text embeddings Ψ(t_sou,t_ins)\Psi(t\_{sou}, t\_{ins}). We then concatenate those embeddings with tlt_l to form the textual conditioning input for Infinity. Finally, while keeping Infinity's weights frozen, we optimize only tlt_l using cross-entropy loss, where the supervision signal comes from tokens R1...KsouR_{1...K}^{sou} derived from the source image.

(b) The blending of quantized tokens:

(i) The blending operation is performed on the quantized features (Rkk=1K,RkRhk×wk×d)(\\{R_{k}\\}_{k=1}^K, R_k \in \mathbb{R}^{h_k \times w_k \times d}) rather than on indices, which is formulated as Ektar=Upsample(Rktar,(hK,wK))(1G)+Upsample(Rksou,(hK,wK))GE_k^{tar} = \mathrm{Upsample}(R^{tar}_k, (h_K,w_K)) \odot (1-\mathcal{G}) + \mathrm{Upsample}(R^{sou}_k, (h_K,w_K)) \odot \mathcal{G}.

At each step kk of the autoregressive generation, the target token RktarR_k^{tar} is generated conditioned on the concatenation of the target prompt embedding and instruction embedding Ψ(t_tar,t_ins)\Psi(t\_{tar},t\_{ins}), and optimized tlt_l. It is then blended with source tokens RksouR_k^{sou} of the source image under the guidance of the piecewise linear smoothing kernel G\mathcal{G}. Although this formulation is already provided in Algorithm 1 of the main paper, we appreciate the reviewer’s suggestion and will enhance the textual explanation in the final version to improve clarity and emphasis.

(ii) The feasibility of performing linear blending at the quantized feature level is grounded in the fact that the Infinity decoder takes quantized features as input to generate images. As such, performing interpolation in the feature space is both reasonable and commonly adopted in generative modeling [1-3], as it supports smooth and semantically meaningful transitions.

[1] Alec Radford, et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[C]//ICLR. 2016.

[2] Kingma D P, Welling M. Auto-encoding variational bayes[C]//ICLR. 2014.

[3] Zhu J Y, et al. Toward multimodal image-to-image translation[C]//NeurIPS. 2017.

Q2: Also, I would like to see clarification on the experimental results. Are other baseline methods also using Inifinity as the base model? If not could the worse results for these methods be due to their worse base models?

A2: This is an insightful question. Our method is a lightweight framework tailored to Infinity and indeed relies on the generative capacity of the base model. However, this dependency is not unique to our approach. We explicitly present base models used by all methods in Table 1 below and evaluate their generative performance on the GenEval benchmark [4], one of the most widely used benchmarks for assessing text-to-image generation quality. As shown in Table 2, Infinity achieves comparable overall generative performance to the recently popular FLUX [5], and even underperforms in certain tasks, such as two-object and counting. Nevertheless, our Infinity-based image editing method significantly outperforms FLUX-based approaches like StableFlow and RF-Edit, demonstrating the effectiveness of our method spite the base model not having a clear advantage.

Table 1: Quantitative results on PIE-Bench.

MethodBase ModelPSNR↑LPIPS(×10³)↓MSE(×10⁴)↓SSIM(×10²)↑Whole↑Edited↑IR(×10)↑
P2PStable Diffusion v1.417.87208.80219.8871.1425.0122.440.017
MasaCtrlStable Diffusion v1.422.17106.6286.9779.6723.9621.16-1.66
P2P-ZeroStable Diffusion v1.420.44172.22144.1274.6722.8020.54-6.59
NTIStable Diffusion v1.427.0360.6735.8684.1124.7521.862.77
PnP-InvStable Diffusion v1.522.46106.0680.4579.6825.4122.624.17
NPStable Diffusion v1.426.2169.0139.7383.4024.6121.872.42
StableFlowFLUX.1-dev21.6492.28115.2184.9424.6521.701.88
RF-EditFLUX.1-dev23.22131.1875.0081.4425.2222.405.18
EditInfinity (Ours)Infinity27.9533.0824.2792.1226.4123.475.88

Table 2: Evaluation on the GenEval benchmark.

Base ModelOverallSingle ObjectTwo ObjectCountingColorsPositionAttribute Binding
Stable Diffusion v1.40.420.970.390.330.730.030.05
Stable Diffusion v1.50.430.970.380.350.760.040.06
FLUX.1-dev0.660.980.810.740.790.220.45
Infinity0.660.980.780.630.830.250.53

Note: When evaluating Infinity, we adopt the same evaluation protocol as used for Stable Diffusion v1.4, v1.5, and FLUX.1-dev, i.e., without prompt rewriting.

[4] Dhruba Ghosh, et al. Geneval: An object-focused framework for evaluating text-to-image alignment[C]//NeurIPS. 2024.

[5] Black Forest Labs. Flux. 2024.

评论

I would like to thank the authors for detailed responses. The reponse has addressed my question regarding the method description. The added table for editing results seems to be crucial. Please make sure to include it in the manuscript.

For the prompt embedding, it is crucial to describe the prompt tuning approach as described in the response; otherwise, the audience may not understand what is happening.

Given that my concerns are addressed, I am recommending borderline accept.

最终决定

This paper introduces EditInfinity, which is a text-based image editing method based on the autoregressive text-to-image generation model Infinity, which succeeds in high quality generation among vector quantized models available. The authors motivate the paper from the existing inversion-then-editing approach, where instead of inverting into a noisy latent (like in diffusion models), they initially invert the input text prompt by an extended embedding representation, which is followed by LoRA based adaptation for the reconstruction of precise details. Given a target text prompt, an input image and a region of interest specified by a mask, EditInfinity performs editing by manipulating the multi-scale quantized feature maps with a smoothed mask, then reconstruct the image using these manipulated feature maps. Authors demonstrate the effectiveness of the approach by using PIE-Bench, where they compare their approach with diffusion/flow-matching based editing methods. The provided quantitative analyses and qualitative results shows the effectiveness of the method.

The reviewers found the work somewhat incremental (the paper extends the Infinity model to editing tasks) but considered that there are some benefits to autoregressive image generation (such as speed) that merit further investigation in this area. Three of the reviewers were initially negative. They raised many concerns, which the authors addressed successfully. After discussion there were still some concerns about the method reliance on user-provided masks, a limitation noted by multiple reviewers. This requirement reduces its generalizability. However, since VQ-based approaches remain relatively underexplored, the reviewers believed the paper is an overall positive contribution. In the end, the reviewers were mildly positive towards publication of the paper.