PaperHub
5.5
/10
Poster4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.5
置信度
正确性2.8
贡献度2.5
表达2.8
NeurIPS 2024

PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference

OpenReviewPDF
提交: 2024-05-11更新: 2024-11-06
TL;DR

This paper makes the first attempt to align diffusion models for image inpainting with human aesthetic standards via a reinforcement learning framework.

摘要

关键词
diffusion modelimage inpaintinghuman feedback reinforcement learning

评审与讨论

审稿意见
5

This paper tries to align diffusion models for image inpainting with human aesthetic standards via a reinforcement learning framework.

优点

  1. This paper proposes aligning diffusion models for image inpainting with human preferences by integrating human feedback through reinforcement learning, which improves the quality of generated images.
  2. This paper presents a dataset containing 51,000 inpainted images annotated with human preferences, addressing the lack of evaluation datasets for image inpainting.
  3. The proposed method can be applied to various applications, such as image extension and novel view synthesis, providing visually impressive results.

缺点

  1. There has been a lot of work introducing human preferences into diffusion models, such as Human Preference Score (ICCV 2023), ImageReward (NeurIPS 2023), DPOK (NeurIPS 2023), and D3PO (CVPR 2024).

    a. This paper does not fully elaborate on the work related to diffusion models with human preferences.

    b. It is necessary to supplement the comparison with these methods, including differences in methodology and advantages in experimental results.

  2. A user study is necessary to evaluate the quality of the generated results.

问题

See Weaknesses.

局限性

The authors outlined potential future directions in the conclusion.

作者回复

Reviewer G8Cx

Q1. Differences from Other Diffusion Alignment

Our work is significantly different from existing diffusion alignment work. While some related works exist in the area of text-to-image tasks involving human preference, our work is the first to align diffusion-based image inpainting with human preference through reinforcement learning (RL). Initially, our proposal of the human preference inpainting dataset enables the feasibility of this task. Without such a dataset that includes high-quality labels based on human preference, undertaking inpainting alignment would be impossible. Technically, we propose a reward accuracy-aware weight strategy into the RL process to further accelerate the training process and boost performance, which has also not been investigated in the mentioned works. We will add further illustrations about the works related to diffusion model alignment with human preference.

Q2. Experimental Comparisons with your Mentioned Methods

Moreover, to further resolve the reviewer's concern, we have experimentally compared our method with your mentioned methods. Specifically, we simply summarize the implementation of each method.

(1) Human Preference Score (ICCV 2023)[1] learns a negative prompt to map the diffusion process to low-quality samples. Then, in the inference process, the negative sample is utilized in the classifier-free guidance~(CFG) to push the generation trajectory away from low-quality samples.

(2) ImageReward (NeurIPS 2023)[2] trains a reward model and then applies the reward model as a loss metric to end-to-end optimize the diffusion model accompanied by a reconstruction loss. We also conduct the ablation study on reward training strategy in Table 3 in our paper. Our method employs a regression-driven training strategy, while ImageReward (NeurIPS 2023) [2] a classification-drive strategy.

(3) DPOK (NeurIPS 2023) [3] simultaneously optimizes the whole trajectory of a reverse diffusion process and utilizes the KL divergence to panel the regularization, avoiding a large distribution shift.

(4) D3PO (CVPR 2024) [4] adopts the RL strategy from direct performance optimization (DPO)[5], directly optimizes the model on the reward labeling data to minimize the probability of low-quality samples and increase the probability of high-quality samples.

The experimental results shown in the table below further validate the advantage of the proposed method.

Methods (All metrics the larger the better)WinRateT2IRewardCLIPBLIPCA
(1) Human Preference Score58.03%-16.670.260.200.470.40
(2) ImageReward65.10%13.120.290.220.480.44
(3) DPOK (KL weight=0.1)64.59%11.430.320.210.480.43
(3) DPOK (KL weight=1.0)62.67%9.360.300.210.480.43
(4) D3PO59.74%-19.200.260.210.460.41
PrefPaint(Ours)71.27%21.530.370.230.490.45

Moreover, the authors will provide further discussion in the final version of the paper.

User Study

First, we clarify that the labeling of our dataset is provided by a professional data labeling company, with all annotators being professionals trained in similar tasks. The reward scoring is highly accurate and capable of assessing the results of inpainting under criteria based on human preference. We also provide some examples scoring by our reward model in Fig S4 in the uploaded one-page PDF file, the assessment of inpainted images is closely aligned with human preference. Moreover, to alleviate your concerns, we have also carried out a user study to evaluate our superiority. We have randomly selected about 130 groups of results and conducted a user study involving 10 users, as detailed below. The WinRate map and the diagram of the user study platform can be found in Fig. S3 and Fig. S5 of the uploaded one-page PDF file.

MethodsKandinskyPaletteSDv2.1Runway (BaseModel)PrefPaint (Ours)
Rank4.743.793.112.261.10
Var0.350.490.600.480.24
评论

Dear Reviewer G8Cx,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript. We understand that you may be reviewing multiple papers and have a busy schedule. In our previous response, we made sure to address your remaining concerns directly and thoroughly. We eagerly await your further feedback on our responses.

Best regards,

The Authors

审稿意见
6

This paper is the first to use reinforcement learning in diffusion-based image synthesis. This significantly improves the quality since image synthesis is usually a one-to-many mapping, which may not be suitable for conventional learning methods. To generate reward functions for RL, this paper also gathers and releases a new dataset on image synthesis with human preference annotation. Additionally, the paper computes the theoretical upper bound on the error of the reward for more efficient RL training.

优点

  1. Release a new dataset for image inpainting and outpainting benchmarks with human preference annotation. The generation process of the dataset is well-documented.
  2. First incorporate RL with diffusion-based method on image inpainting and outpainting.
  3. The authors provide extensive experiments, supplementary materials with results on multiple datasets, and a project page to demonstrate their results.

缺点

  1. Minor question on the motivation of amplification factor.
  2. Possible missing references.

问题

  1. In L152, how often do these "relatively large errors" occur? Will the relatively smaller errors balance out these large errors? Since the amplification γ\gamma greatly improves the performance, does this suggest that the errors occur frequently so that the static γ\gamma do not handle these very well? From Fig.2, it is shown that the range of zv1\| z \|_{v^{-1}} is [0, 1]. Does amplification γ\gamma for k=0.05k=0.05 and b=0.7b=0.7 have a range of [1.9, 2.0] (which seems narrow and thus static)?

  2. How does this paper compare to [1]? [1] leverage the BERT model for reward scoring. How do you justify the choice of human evaluation instead of foundation models for reward function? Do you have ablation experiments to backup the decision? [1] Black, Kevin, et al. "Training diffusion models with reinforcement learning." arXiv preprint arXiv:2305.13301 (2023).

局限性

N/A

作者回复

Reviewer yfx6

Q1. Error of Reward Model & Amplification Factor

We make statistics of reward estimation errors, and the results are shown in Fig. S1 of the uploaded one-page PDF file. Here we make a table to briefly show the results. Although the proportion of very large error samples is not large, the incremental performance of our method may lie in a more suitable choice of amplification function, as evidenced by the table in our response to the Q2 from the reviewer Hftn.

Sorry for causing the confusion of the reward range. Actually, the x-axis of Fig. 2 of the manuscript is normalized. The range of $||z||V^{-1}$ lies approximately between 1.5 and 14. Thus, the range of the amplification factor is approximately between exp(0)=1 and exp(0.625)=1.86.

Reward Error[0, 0.25)[0.25, 0.5)[0.5, 0.75)[1.0, 1.25)[1.25, 1.5)[1.5, 1.75)[1.75, +inf)
Percentage43.99%30.57%16.24%6.53%1.97%0.59%0.06%

Q2. Reward Model

We have experimentally validated the consistency between the proposed method and the BERT score [1]. As shown in the Fig. S2 in the uploaded one-page PDF file, the deeper color indicates the larger errors in the BERT score. We can see that there are plenty of samples whose BERT scores are unrelated to human labeling even contradictory, shown as the upper left or lower right regions. We also provide some examples scoring by our reward model in Fig S4 in the uploaded one-page PDF file, the assessment of inpainted images is closely aligned with human preference. Thus, our proposed human-labeled dataset is necessary for the alignment of inpainting tasks. Note that our dataset is labeled by professional data annotation company. We will add more discussion with [1] in the final version. Thanks for the advice.

[1] Black, Kevin, et al. "Training diffusion models with reinforcement learning." arXiv preprint arXiv:2305.13301 (2023).

评论

Dear Reviewer yfx6,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript and your favorable recommendation. We understand that you may be reviewing multiple papers and have a busy schedule. In our previous response, we made sure to address your remaining concerns directly and thoroughly. We eagerly await your further feedback on our responses.

Best regards,

The Authors

评论

The authors have addressed all of my concerns.

评论

The authors sincerely appreciate your feedback.

审稿意见
6

This paper makes the first attempt to align diffusion models for image inpainting with human preferences by integrating human feedback through reinforcement learning. The authors theoretically deduce the accuracy bound of the reward model, modulating the refinement process of the diffusion model to robustly improve both efficacy and efficiency. Additionally, they construct a dataset containing 51,000 inpainted images annotated with human preferences.

优点

  1. This paper provides theoretical insights into the upper bound of the error of the reward model, ensuring the reliability and accuracy of the reinforcement learning process.
  2. This paper presents a dataset for image inpainting tasks that incorporates human aesthetic preferences. This dataset will facilitate further research into the evaluation of image inpainting models and aid in generating high-quality images that better align with human aesthetics.

缺点

  1. Since reward models and feedback learning trained based on human preferences have been introduced in Text-to-Image Generation (T2I), it seems they are simply applying the same process from T2I to Image Inpainting. This work should clarify the distinctions from T2I to highlight its innovation.
  2. The core contribution of this paper is Equation 11; however, its rationale, implementation details, and effectiveness lack explanation and validation. First, the selection of hyperparameters kk and bb is not addressed, despite their importance. Second, there is a lack of detailed explanation from Equation 11 to its implementation in the model. Third, the choice of using ee instead of other functions needs to be validated. Finally, the description of the ablation study on kk in Table 3 is unclear, making it difficult to assess its effectiveness. There is also a lack of ablation analysis for the bb and ee.
  3. The weights assigned to the three scores in the manuscript are [0.15, 0.15, 0.7]. Why are these scores combined into a weighted score instead of being used individually? Furthermore, how were these weights determined? The basis for this weighting is unclear.
  4. What does the "Rank" metric in Table 2 signify? It lacks explanation.
  5. There are details missing regarding the dataset annotation. Such as the number of annotators employed, whether training was required for them, and the amount of time each individual spent on the annotation task.

问题

Please refer to the weaknesses.

局限性

The authors provide the future work in the last section.

作者回复

Reviewer Hftn

Q1. Differences from T2I Methods

We confirm that our method is NOT simply applying the text-to-image alignment scheme to image inpainting. The technical novelty of the proposed method primarily lies in modeling reward accuracy and adaptively controlling the regularization reward strength, which has not been investigated by T2I methods before. Moreover, the ablation studies presented in the right part of Table 3 in the main paper validate the superiority of our design, which can achieve a +106% speed acceleration while maintaining high performance (even increasing WinRate by up to 1.7%). We will add further discussion comparing our method with current T2I methods. More importantly, we want to note that this is the first time to explore the diffusion model human preference alignment problem on the task of image inpainting. To enable the feasibility of this task, we built the first dataset, where inpainted images are labeled with human preference through a professional data annotation company. Without such a dataset, it is almost impossible to conduct this work. We believe it will advance this field. Our dataset holds significant research value and application scenarios in several related areas, including image quality assessment tasks and other tasks involving human preference. We also provide some examples scored by our reward model in Fig S4 in the uploaded one-page PDF file, the assessment of inpainted images is closely aligned with human preference.

Q2. More Explanations of Eq. (11)

Selection of Hyper-parameters. The critical thing for pasteurization is the region of the weight factor γ\gamma. In the left of Table~3 (e) and (f) of our manuscript, we experimentally validate the two sets of k=0.05,b=0.7k=0.05,b=0.7 and k=0.065,b=0.9k=0.065,b=0.9. The experimental results show that our selection is better than the other settings. Moreover, please refer to our response to the third question, where we detailedly investigate the different hyper-parameters and weighting functions.

Implementation details. We first prepare the matrix V1\mathbf{V}^{-1} by calculating V=ZTZ+λI\mathbf{V} = \mathbf{Z}^T\mathbf{Z} + \lambda \mathbf{I}, where Z\mathbf{Z} represents the concatenation of feature embeddings before the last MLP layer of the reward model, encompassing all training dataset samples. During the diffusion training process, for each sample, we obtain a feature zz from the reward model. We then calculate γ\gamma, which serves as the weight factor for the final RL loss, adaptively adjusting the magnitude of the gradient. Thanks for the comments; we will include the corresponding content in the final version.

Other Functions & Ablation Study. To further resolve your concern, we have applied another linear function to parameterize γ\gamma as table below. Specifically, γ=1.9zV1+0.06\gamma = -1.9*\| \boldsymbol{z}\|_{\mathbf{V}^{-1}} + 0.06. The experimental results indicate that the exponential function provides the best regularization effect. In contrast, the linear function and static constant do not fully exploit the regularization effect of the reward upper boundary.

Function (x=zV1x = \| \boldsymbol{z}\|_{\mathbf{V}^{-1}})RangekbWinRateReward
γ=ekx+b\gamma = e^{- kx + b}(Ours)[1.00, 1.87]0.050.771.27%0.37
γ=ekx+b\gamma = e^{- kx + b}[1.00, 2.23]0.0650.970.47%0.36
γ=ekx+b1/x+b2\gamma = e^{- kx} + b_1/x + b_2[1.10, 1.78]0.10[0.1, 0.85]70.07%0.37
γ=ekx+b1/x+b2\gamma = e^{- kx} + b_1/x + b_2[1.10, 2.22]0.12[0.8, 0.85]69.95%0.36
γ=kx+b\gamma = -kx + b[1.10, 1.81]1.90.0660.28%0.28
γ=b\gamma = b----1.4365.95%0.34

Q3. Weight Range

Since RL requires a reward value to guide the training direction, we must combine scores to decide whether to push the diffusion model away from the reconstruction sample or bring them closer together. We determine these weights using our scoring scheme. Specifically, the first two scores focus on partial aspects, such as structure and texture, respectively. The third score, however, reflects the overall impression, which is more comprehensive. Therefore, it is reasonable to assign lower weights to the first two metrics and a higher weight to the final metric. Additionally, to further address your concern, we had the labeling supplier company directly rank these samples to validate the consistency between our weighted combination of scores and the human labeling rank. Specifically, the rank of several reconstructions by both human experts and previous label weighting schemes. Then, we calculate the consistency between the human ranking and the weighted score ranking. The experimental results are shown in the following table. The proposed weighting combination scheme is validated by its high consistency with the human scoring scheme. The scoring system is shown in Fig. S6 of the uploaded one-page PDF file.

Weighted ScoreRank1(%)Rank2(%)Rank3(%)All(%)
0.15, 0.15, 0.7 (Ours)0.930.920.930.93
0.10, 0.10, 0.80.930.920.930.93
0.20, 0.20, 0.600.920.920.920.92
0.30, 0.30, 0.400.870.860.880.87
0.50, 0.40, 0.100.850.870.840.85
0.80, 0.10, 0.100.840.850.850.85

Q4. Meaning of the Metric "Rank"

“Rank” in Table 2 indicates an average order (from the best to the worst) of all metrics. Thus, a lower rank indicates better performance.

Q5. Dataset Details

Note that the labeling is provided by a professional data annotation company. All annotators are professionals trained in similar tasks. The annotation time for each pair (3 samples) averages about 2 minutes. A total of 24 annotators were employed. We will provide more details in the final version.

Q6. Limitations

We have indeed discussed some limitations in the final paragraph of the conclusion section.

评论

Dear Reviewer Hftn,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript and your favorable recommendation. We understand that you may be reviewing multiple papers and have a busy schedule. In our previous response, we made sure to address your remaining concerns directly and thoroughly. We eagerly await your further feedback on our responses.

Best regards,

The Authors

评论

By combining the other reviews and responses, I consider that this work has contributions on leveraging human preference to diffusion-based image inpainting through reinforcement learning, and the provided dataset also helps related research. Thus I would like to upgrade my rating.

评论

The authors sincerely appreciate your feedback.

审稿意见
5

This paper attempt to align diffusion models for image inpainting with human aesthetic standards through reinforcement learning framework. To train the model, this paper construct a dataset containing 51,000 inpainted images annotated with human preferences. Extensive experiments on inpainting comparison and downstream tasks, such as image extension and 3D reconstruction, is provided in this paper.

优点

  1. This paper is well presented and easy to read.
  2. This paper provids detailed experiments to validate the effectiveness of the proposed method.
  3. This paper have collect a small scale dataset for human preference of image inpainting results, which can be useful for the ressearch field.

缺点

  1. This paper seems to use a common reinforcement learning pratice on text-to-image diffusion model. What is the novelty of the proposed method? Why would this practice better than simply finetuning the model on high-quality (human prefered) inpainting data?

  2. There are missing citations and comparisons with the following methods:

    • A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
    • BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
    • Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models
  3. Can you provide an ablation of how the training data quantity influence final results? The training dataset only contain 17,000 images and 51,000 inpainted samples, which is quite small in diffusion model training.

问题

See weekness

局限性

The authore have include limitation in the paper, but I would recommend the authore discuss more on the limitation and negative societal impact.

作者回复

Reviewer 85gD

Q1. Difference between the Proposed Method, Common Reinforcement Learning, and Supervised Fine-tuning

In the following, We first clarify that our method is NOT a straightforward practice of the common reinforcement learning on diffusion model alignment. Then, we address your confusion about supervised fine-tuning (i.e., directly tuning on high-quality data) and reinforcement learning-based tuning.

(1) Not the same task as text alignment RL task. While some related works exist in the area of text-to-image tasks involving human preference, our work is the first to introduce reinforcement learning (RL)-based alignment into the task of image inpainting. Initially, our proposal of the human preference inpainting dataset enabled the feasibility of this task. Without such a dataset that includes high-quality labels based on human preference, undertaking inpainting alignment would be impossible. Note that our dataset is annotated by professional data annotation company. The technical novelty of the proposed method, compared to conventional RL, primarily lies in our modeling of reward accuracy and the adaptive control of reward regularization strength using the reward error upper bound, as detailed in Secs. 3.2 and 3.3 of our paper.

(2) Supervised fine-tuning cannot address our task. We first clarify some experimental results that our baseline model (the pretrained diffusion model) has already trained on plenty of GT samples for image inpainting, which is even better than our high-preference samples. However, the limited performance of the pretrained model, with comparison of Runway and Ours in Tables 1 and 2 of our paper, indicates the supervised fine-tuning model is hard to accurately and effectively modifying model generation on the perspective of preference alignment.

Moreover, to further address your concerns, we have experimentally validated the neural network performance using fine-tuning based methods.

No.MethodTrain DatasetVal DatasetT2IBLIPCLIPCAReward
(a)Original (Best Samples)--Training Prompt3.610.490.220.450.19
(b)FineTuning-ModelTraining PromptTraining Prompt-4.890.490.220.440.12
(c)FineTuning-ModelTraining PromptVal Prompt-13.610.480.210.400.02
(d)RL-Model(Ours)Training PromptVal Prompt14.900.490.230.450.37

where (a) signifies the high-quality subset generated from the original samples and selected by our reward labels; (b) represents the reconstructed images by the fine-tuned model; (c) corresponds to the fine-tuned model's performance on the validation dataset; (d) indicates the generated samples by the RL-based model evaluated on the validation dataset. We observe that the fine-tuned model performs not well on the training set and shows even more undesirable performance on the validation set. In contrast, the proposed method accurately aligns with human preferences.

Q2. Comparisons with Additional inpainting Methods

We have experimentally compared with your mentioned papers. For all these methods, we assessed performance using their publicly released models. We utilized our test dataset to comprehensively evaluate their performance. As shown in the table below, our method significantly outperforms all the compared methods.

Metrics (the larger the better)T2IBLIPCLIPCA(Incep.)Reward# Param(M)Infer. Time(s)
PowerPaint(v-1)-4.440.460.210.42-0.057819.725
PowerPaint(v-BrushNet)-3.840.460.200.42-0.0361409.8816
BrushNet(realistic-V15VAE)1.260.460.220.430.1371409.8616
HdPaint(ds8)-4.570.470.210.44-0.059451.4760
PrefPaint(Ours)11.600.490.230.450.374819.725

Moreover, we also tested the Winrate compared with our baseline model (Runway). Our method also greatly surpass all compared methods, which validates the effectiveness of the proposed RL-based alignment scheme.

WinRate (v.s. BaseModel) (the larger the better)S=1S=2S=3
PowerPaint(v-1)[ECCV2024]27.06%39.92%47.38%
PowerPaint(v-BrushNet)[ECCV2024]29.86%43.12%52.01%
BrushNet(realistic-V15VAE)[ECCV2024]49.49%62.83%69.22%
HdPaint(ds8)[Arxiv2023]33.37%43.41%49.03%
PrefPaint(Ours)71.27%85.88%93.50%

Q3. Size of Training Dataset

We believe you have a misunderstanding here. We clarify that the dataset is primarily used to train the reward model rather than the diffusion model. With an accurate reward model, we can easily train diffusion models on scaled datasets by allocating more prompts. We empirically validate that the dataset size is sufficient for training a reward model. Specifically, as shown below, we train the reward model with 10K, 20K, 30K, and 50K data samples, respectively.

Size1W2W3W5W
Reward Accuracy ↑72.5%74.0%75.3%75.9%

We observe that the accuracy of the reward model becomes saturated as the size of the training dataset increases. Therefore, we believe that 50K data samples are sufficient for training an accurate reward model. Additionally, the proposed method has already outperformed SOTA, validating its effectiveness.

评论

Dear Reviewer 85gD,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript. We understand that you may be reviewing multiple papers and have a busy schedule. In our previous response, we made sure to address your remaining concerns directly and thoroughly. We eagerly await your further feedback on our responses.

Best regards,

The Authors

评论

I change my score to 5 since the authors' feedback addressed most of my concerns. I strongly recommend the authors add the new experiment results into their final version.

评论

The authors appreciate your feedback. We promise that the comparions with additional methods listed in the rebuttal will be be incorporated into the final version or the supplementary materials.

作者回复

General Response

We thank all reviewers for their time and constructive comments. We sincerely thank Reviewer yfx6 for the affirmation of the motivation and novelty behind our task, as evidenced by comments such as "Release a new dataset..." and " First incorporate RL with image inpainting ..." , as well as acknowledging the thoroughness of our experimental evaluations, as highlighted by comments like "The authors provide extensive experiments, ...".

Furthermore, we are thankful to Reviewer Hftn for acknowledging the theoretical contributions outlined in our paper, as illustrated by comments such as "This paper provides theoretical insights into ...", and for recognizing the significance of our dataset in addressing this emerging task, as indicated by the comment "This paper presents a dataset for image in...". We value the insightful feedback provided by all reviewers, which has greatly enriched our work.

We believe we have clearly and directly addressed all concerns. Here, we would like to summarize a few key clarifications regarding the contributions of our work.

(1) Our method makes the first exploration of the diffusion model human preference alignment problem on the task of image inpainting and proposes a novel benchmark with human experts labeled preference score. While many T2I models use RL to enhance the consistency between text meaning and image content, our method specifically focuses on the task of image inpainting. As far as we know, the alignment of the diffusion model in this context has not been explored before.

(2) We propose a human preference-aware inpainting dataset enabled the image inpainting alignment based on Reinforcement Learning. Without such a dataset that includes high-quality labels based on human preference, undertaking inpainting alignment would be impossible. Note that our dataset is annotated by a professional data annotation company.

(3) The technical novelty of the proposed method mainly lies in modeling the upper bound of reward estimation error and using it to adaptively control the reinforcement learning regularization strength, an approach that, to the best of our knowledge, has not been previously investigated.

(4) The reason for using RL for diffusion alignment lies in the differences between the training and testing process of diffusion models. Specifically, the diffusion model is trained on individual steps from a decoupled probabilistic flow, while its inference process involves running the entire trajectory and projecting random noise onto data samples. Since the score function (noise) for each step cannot be accurately estimated, the accumulated errors for each reversing step may cause the reconstructions to drift away from the targets. RL is a method that can optimize the entire trajectory (Markov Chain) and account for the accumulated error of each step, as the reward is measured based on the final reconstruction

Thanks again for the time and effort. We appreciate any further questions and discussions.

Last but not least, we will make the reviews and author discussion public regardless of the final decision. Besides, we will include the newly added experiments and analysis in the final manuscript/supplementary material.

最终决定
  • Reviewers acknowledged the contributions of applying RL to align the diffusion-based image inpainting/outpainting models with human preferences and of the newly introduced dataset.
  • Reviewers also raised many technical questions: 1) the motivation and the implementation details of the reward weighting factor (yfx6, Hftn), 2) comparing with direct fine-tuning models and recent image inpainting models (85gD), 3) comparing with RL-based methods for T2I diffusion models (Hftn, G8Cx), 4) the details of data annotation (Hftn) and some ablation studies.
  • In the rebuttal, the authors did an excellent job by providing thorough and direct responses and successfully addressed most of the reviews’ questions.
  • The AC recommends its acceptance after reading all the reviews, rebuttals and discussions. It's a solid work, although its contributions may be limited by the chosen reward model and the specific task of non-guided image inpainting. The AC strongly requests the authors to incorporate the tables and results in the responses to the final version, and to consider comparing with diffusion-DPO (https://arxiv.org/abs/2311.12908) as noted in the limitation section.