Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
We present the $lambda$-Harmonic reward function and Reward Preference Optimization (RPO) for the subject-driven text-to-image generation task.
摘要
评审与讨论
Summary: This paper proposed a method which utilized ALIGN model to calculate -Harmonic reward function and combine binary cross-entropy function and DPO as the preference loss to supervise image generation models.
优点
Strength:
- The method is computation-efficient while achieving the highest scores on CLIP-I and CLIP-T.
- Successfully extend preference optimization methods to Subject-driven image generation.
缺点
Weakness:
- I think there should be more visualization results showing the superiority of the proposed method. However, even considering the images in Appendix, the number of generated images is still not enough.
- RPO does not achieve satisfactory performances in DINO scores. The authors have explained the reason, but I still wonder whether there is any method that could prevent the loss of detailed pixel-level information.
问题
No question. Just hope the authors could provide more visualization examples and discuss more about DINO scores.
局限性
Yes
Thank you for the detailed review and thoughtful feedback. Please see our responses to your specific questions below.
Q: More visual results
We have added 32 images generated by RPO to the attached PDF on global response, specifically focusing on prompts that are both unseen in the training data and highly imaginative.
Q: Improve DINO scores
To preserve the details of the reference images, we may need to use ControlNet [1] or incorporate an additional cross-attention layer for the reference images within the U-Net component [2]. Mathematically, these methods will modify the distribution from to allowing the model to capture extra information from the reference images during the inference phase. However, RPO does not have any assumptions about the model architecture. Therefore, in future work, we will integrate the ControlNet or cross-attention layer approach with RPO to improve DINO scores.
[1] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[2] Chen, Wenhu, et al. "Subject-driven text-to-image generation via apprenticeship learning." Advances in Neural Information Processing Systems 36 (2024).
Dear Reviewer u7k9,
Thank you again for reviewing this paper. Since the reviewer-author discussion phase is closing soon, could you please respond to the authors' comments?
Best,
AC
This paper presents a method for generating personalized images from text using the λ-Harmonic reward function in a system called Reward Preference Optimization (RPO). Specifically, RPO fine-tunes a pre-trained text-to-image diffusion model using a reinforcement-learning-based objective concerning with the harmonic averaged value of text similarity and image similarity. This objective is used in a reinforcement learning framework for diffusion models. The results demonstrate that the proposed techniques improve the performance of the DreamBooth baseline and alleviate the image overfitting problem.
优点
- The paper is well-structured and easy to follow.
- The proposed λ-Harmonic reward function is reasonable and has achieved better results than the baseline.
缺点
- The authors could introduce briefly how does the method support adjusting λ at the test time. How about the method comparing with adding classifier-based guidance?
- The wall-clock time analysis of the proposed method against the baseline DreamBooth in training and sampling is missing.
- The experimental analysis of why harmonic mean is used instead of arithmetic mean is missing.
- Comparisons could be made more extensive. There are more state-of-the-art personalized generation methods nowadays like [a,b].
- The authors may want to compare with more baselines that also support adjusting the text-image-tradeoff at the test time [c,d,e].
- The motivation to alleviate the image overfitting and enhance the prompt adherence has been explored in [f]. Comparisons or at least some discussions are necessary.
[a] Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation, Chen et al., ICLR 2024.
[b] Multi-concept customization of text-to-image diffusion, Kumari et al., CVPR 2023.
[c] ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation, Wei et al., ICCV 2023.
[d] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, Ye et al..
[e] SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation, Zhang et al., CVPR 2024.
[f] PALP: Prompt Aligned Personalization of Text-to-Image Models, Tewel et al..
问题
- The metrics evaluations in Tables 1 and 2 appear inconsistent, despite seemingly identical parameters (λval = 0.3).
- Additionally, in the metrics ALIGN-I and ALIGN-T, I am unclear about the rationale behind adding "+1" to the similarity calculations.
局限性
This paper have discussed it’s limitations.
Thank you for the detailed review and thoughtful feedback. Please see our responses to your specific questions below.
Q: How does the method allow for adjusting during inference, and how does it compare to classifier-guidance inference?
Our contribution is on the fine-tuning step on a pretrained diffusion model. Currently, our method (RPO) only provides reward or preference signals during the fine-tuning phase. allows flexibility in the model-selection process for the trade-off between image-image alignment and text-image alignment for subject-driven tasks (details can be seen in Table 3). The reward function selects the optimal checkpoint during fine-tuning, and we do not modify the diffusion reverse process for the inference phase. Therefore, our inference is the same as the SD, which is classifier-free.
Q: Wall-clock comparison between DreamBooth. We define the fine-tuning time as the sum of the preparation time and training time. The preparation time refers to the time spent generating negative samples. We use RPO and DreamBooth to fine-tune SD on 4 TPU v4 nodes and report the wall-clock fine-tuning time for these two methods in the following table.
| Method | Aesthetic Score |
|---|---|
| DreamBooth | 28 min 38.87 sec |
| Ours: RPO | 7 min 59.78 sec |
Q: Experimental results of arithmetic mean is missing.
We set the harmonic mean as the arithmetic mean and test the . We report the results of the arithmetic mean reward function in the following table. We refer readers to Table 3 in our paper for comparison. We discussed the difference between the harmonic mean and the arithmetic mean from lines 139 to 143. The arithmetic mean is not very sensitive to smaller values; it tends to maximize higher values to achieve a better final score. In practice, ALIGN-I will receive a higher value (this effect can be seen from CLIP-I and CLIP-T in Tables 1 to 3). Thus, the model will tend to optimize image-to-image alignment and achieve good results on DINO and CLIP-I but have a lower score for text-to-image alignment.
| DINO | CLIP-I | CLIP-T | |
|---|---|---|---|
| (arithmetic) | 0.638 0.083 | 0.823 0.037 | 0.318 0.027 |
| (arithmetic) | 0.702 0.078 | 0.857 0.047 | 0.295 0.017 |
| (arithmetic) | 0.678 0.085 | 0.851 0.041 | 0.299 0.026 |
Q: Extensive comparisons with other baseline methods.
We encourage readers to refer to the Extensive comparisons with other baseline methods section in the global response.
Q: Discussion of PALP
We encourage readers to refer to the Discussion of PALP section in the global response.
Q: Table values seem to be inconsistent
Table 1 results from (Details show in Table 3). Table 2 results from , which is our default setting.
Q: Unclear about the rationale behind adding "+1" to the similarity calculations.
Firstly, harmonic mean only supports elements that are non-negative [1]. Secondly, our measure of alignment is . Therefore, we add 1 then divide the sum by 2 to ensure each element is between 0 and 1.
[1] Scipy Library. https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_stats_py.py#L215-L307
Thanks for the response especially the additional results, which alleviate some of my concerns. I have increased my score to reflect this. Nevertheless, I still think the insight of this paper is not so strong given the existence of [f].
Thank you again for taking the time to review our paper and for your valuable feedback. We are pleased to hear that the additional results addressed some of your concerns, and we appreciate your recognition of the improvements.
This paper proposes a Reward Preference Optimization (RPO) method, introducing the 𝜆-harmonic reward function to address overfitting and accelerate the fine-tuning process in subject-driven text-to-image generation. Experimental results demonstrate the effectiveness of the proposed approach.
优点
-
By introducing the λ-harmonic reward function, this model can achieve simple and efficient subject-driven text-to-image generation.
-
Quantitative comparisons demonstrate that this model achieves superior text-alignment results.
缺点
-
The novelty of the proposed method is limited for the following reasons. First, although this paper uses the weighted harmonic mean [ref-1] as the reward function, it is a classical Pythagorean mean. Second, the loss 𝐿_𝑝𝑒𝑟 is a simple variation of the binary cross-entropy and DPO.
-
The experimental results do not effectively demonstrate the advantages of the proposed method. Although the proposed RPO method performs better in text-alignment in quantitative comparisons, this improvement is not evident in Figure 3. Specifically, in the first and second rows, the terms "eating" and "on table" do not appear to be closely aligned.
-
The comparisons are unfair since this paper does not compare the proposed method with any reinforcement methods, such as DPO.
-
The ablation studies show the effectiveness of the loss 𝐿_𝑟𝑒𝑓. However, it is not defined in this paper.
-
More details about the CLIP and DINO models need to be provided, as stronger models may yield more detailed results, both in your findings and in compared results.
[ref-1] Weighted Harmonic Means, Complex Anal. Oper. Theory (2017) 11:1715–1728.
问题
Please see weaknesses.
局限性
no
Thank you for the detailed review and thoughtful feedback. Please see our responses to your specific questions below.
Q: Limited novelty of reward function and loss.
Yes, the -harmonic reward function is indeed a classical Pythagorean mean with weights. In fact, when , it is exactly the classical Pythagorean mean. The novelty of our contribution comes from the combination of using this reward function along with reinforcement learning-based finetuning approach, not from the reward function alone. There are two aspects to consider in the design of the -harmonic reward function: provides flexibility to the user in skewing the relative importance of image-to-image alignment and text-to-image alignment. Also, as mentioned in lines 139 to 143, the harmonic variant is preferred over geometric and arithmetic counterparts because we do not want the fine-tuning process to ignore text-to-image alignment since they tend to yield lower reward values.
While is a variation of binary cross-entropy (BCE) and DPO, it has an important difference albeit the seemingly unimpressive formulation. Validation in DPO is done using an external model or human evaluation [1], while trained using a different approach using raw preference data. Our approach uses binary labels (hence requiring the BCE loss) sampled from the reward model to fine-tune, but the validation also comes from the same reward model. This reduces distributional shifts between training and validation.
Using the -Harmonic function and our simple loss function (Equation 9), RPO allows for early stopping and requires only 3% of the negative samples used by DreamBooth, as well as fewer gradient steps, making it highly efficient.
Q: Quantitative comparisons do not show the improvement.
For Figure 3, as we discussed in our paper (Lines 234 to 236), the first prompt is not grammatically correct: "A dog eating a cherry bowl." The correct prompt should be "A dog eating a cherry from a bowl." The trained model was confused by the original incorrect prompt and generated this image. As for the concern over the misalignment with "table" in the second row, the image generated by our model clearly shows the object standing on a table (at the bottom of the image). Furthermore, we also provide an additional comparison in the Appendix A.3. RPO can address some of the failure cases in DreamBooth and SuTI.
Q: Comparison to DPO.
The original DPO is not suitable for subject-driven tasks because the datasets do not contain preference labels. We introduce the -harmonic function and design a variant of DPO for this task. To address your concern, we implemented a pure DPO for diffusion [2] (without image similarity loss) using preference labels for image-to-image similarity and text-to-image alignment. We chose because this value assigns equal weights to image-to-image similarity and text-to-image alignment. For a fair comparison, we also report the results from RPO with the same . The results on DreamBench for these two methods are shown in the following table.
| Method | DINO | CLIP-I | CLIP-T |
|---|---|---|---|
| DPO | 0.338 | 0.702 | 0.334 |
| Ours: RPO | 0.649 | 0.819 | 0.314 |
These results show that DPO can capture the text-to-image alignment from the preference labels. However, without , DPO faces a significant overfitting problem; i.e., it achieves high text-to-image alignment but cannot preserve the subject's unique features.
Q: Undefined function.
Thank you for pointing out the typo. should have in fact been .
Q: Details on CLIP and DINO.
CLIP consists of two encoders: one for the image and one for the text [3]. Suppose a dataset contains pairs of image and caption. The objective is to ensure that the encoding of the image and the encoding of the text (caption) are aligned in terms of cosine similarity. This is achieved by contrastive learning using positive and negative pairs, which naturally proves useful in our case since this cosine similarity can be used as a reward signal for both image-image and image-text alignment. DINO, on the other hand, encodes only images and in a significantly different manner [4]. Image augmentations are made with random perturbation and cropping, passed through the student and teacher models to produce representations that summarize the image. This makes a better fine-grained understanding of images than CLIP in terms of producing accurate rewards, but cannot provide signals on text-image alignment.
[1] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." Advances in Neural Information Processing Systems 36. 2024.
[2] Wallace, Bram, et al. "Diffusion model alignment using direct preference optimization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[3] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning, PMLR. 2021.
[4] Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
Thank you for your response. Most of my concerns have been addressed. While the novelty is somewhat incremental, I believe this is good work. Therefore, I’ve decided to raise my score to borderline accept in support of your efforts. Thank you for your hard work.
Thank you for your thoughtful feedback and for raising your score. We are glad that our response addressed most of your concerns, and we appreciate your recognition of our work.
The paper introduces a novel method for generating text-to-image outputs that incorporate specific subjects from reference images. The authors propose a λ-Harmonic reward function and a Reward Preference Optimization (RPO) algorithm, which simplifies the training process and enhances the model's ability to maintain subject fidelity while generating diverse and contextually accurate images. The approach outperforms existing methods like DreamBooth and SuTI in various metrics, demonstrating significant improvements in training efficiency and image quality.
优点
- Innovative Reward Function: The introduction of the λ-Harmonic reward function is a notable advancement, providing a robust reward signal that facilitates early stopping and regularization, thus accelerating the training process.
- Efficient Training Process: The proposed RPO method significantly reduces the computational resources required for training by only needing 3% of the negative samples used by DreamBooth and fewer gradient steps, making it highly efficient.
- Empirical Performance: The method achieves state-of-the-art performance on the DreamBench dataset, with superior CLIP-I and CLIP-T scores, indicating strong text-to-image alignment and subject fidelity.
- Simplicity: The approach simplifies the training pipeline by fine-tuning only the U-Net component of the diffusion model, avoiding the need for optimizing text embeddings or training a text encoder.
缺点
- Limited Evaluation Metrics: While the paper uses DINO and CLIP-I/CLIP-T scores, it would benefit from additional evaluation metrics that capture other aspects of image quality, such as perceptual quality or user satisfaction.
- Overfitting Risk: Although the λ-Harmonic reward function helps in regularization, there is still a noted risk of overfitting, particularly in generating images with high text-to-image alignment but lower uniqueness in certain features.
问题
How does the proposed method perform with completely unseen or highly imaginative prompts that deviate significantly from the training data?
局限性
N/A
Thank you for the detailed review and thoughtful feedback. Please see our responses to your specific questions below.
Q: Limited evaluation metrics.
Limited evaluation metrics are a common issue in subject-driven tasks. We appreciate you raising this issue and providing reference metrics, such as image quality. In the table below, we report the average aesthetic scores [1] of the real reference images in DreamBench and the average aesthetic scores obtained with the SOTA CLIP I/T lambda configuration () in DreamBench. We observe that RPO does not decrease the quality of images; instead, the generated images achieve slightly better quality than the reference images.
| Method | Aesthetic Score |
|---|---|
| Real images | 5.145 0.312 |
| Ours: RPO | 5.208 0.327 |
Q: Overfitting risk.
We observe this problem and propose a hypothesis in lines 228 to 231. To solve this overfitting issue, we may need to use ControlNet [2] or add an additional cross-attention layer to the reference images within the U-Net component [3]. Mathematically, using these methods, the distribution will be changed from to which can capture the information from the reference images during the inference phase. However, RPO has no assumptions about the model architecture. Therefore, in future work, we will combine the ControlNet or cross-attention layer approach with RPO to alleviate this overfitting risk.
Q: Unseen prompts visualization results.
We have added 32 images generated by highly imaginative prompts to the attached PDF on global response.
[1] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. "Laion-5b: An open large-scale dataset for training next generation image-text models", 2022.
[2] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[3] Chen, Wenhu, et al. "Subject-driven text-to-image generation via apprenticeship learning." Advances in Neural Information Processing Systems 36 (2024).
Thank you for addressing most of my concerns. I have no major issues with the methods and experiments presented in the paper. As a result, I will maintain my current score.
Dear Reviewer 5xjH,
Thank you again for reviewing this paper. Since the reviewer-author discussion phase is closing soon, could you please respond to the authors' comments?
Best,
AC
This paper proposes a harmonic reward function for text-to-image personalization. Specifically, it uses this reward function to perform early stopping and regularization. Experimental results demonstrate the effectiveness of the proposed method.
优点
- Their proposed reward-based model selection and regularization are reasonable and compelling.
- The generated images presented in the paper demonstrate superior text fidelity of the proposed method.
缺点
- More baselines are needed. Specifically, since this paper is similar to DCO [1], a comparison with this method is needed.
[1] Lee, Kyungmin, et al. "Direct consistency optimization for compositional text-to-image personalization." arXiv preprint arXiv:2402.12004 (2024).
问题
What are the results when the lambda value of the harmonic reward function is not set to 0 in the preference loss during training?
局限性
Yes.
Thank you for the detailed review and thoughtful feedback. Please see our responses to your specific questions below.
Q: Comparison to DCO.
Lee et al. [1] use SDXL [2] as the backbone model and apply LoRA to fine-tune the pretrained model. For a fair comparison, we only use LoRA to fine-tune the U-Net component for both methods (DCO and RPO). Furthermore, we implement RPO with SDXL as the backbone model, training the LoRA parameters with 1000 gradient steps and saving the checkpoint from the last gradient step, i.e., without early stopping. RPO achieves better image-to-image similarity results than DCO (higher DINO and CLIP-I). Even without early stopping, the CLIP-T score is only slightly lower (by 0.016) than DCO.
| Method | Backbone | DINO | CLIP-I | CLIP-T |
|---|---|---|---|---|
| DCO | SDXL | 0.593 | 0.792 | 0.343 |
| RPO (w/o early stopping) | SDXL | 0.644 | 0.823 | 0.327 |
Q: What happens when during training?
We test four different values of with the default and report their results in DreamBench. We observe that the image-to-image alignment increases with larger , but the text-to-image alignment decreases because the preference model tends to favor alignment with reference images and ignores the prompt alignment.
| DINO | CLIP-I | CLIP-T | |
|---|---|---|---|
| 0.581 0.113 | 0.798 0.039 | 0.329 0.021 | |
| 0.646 0.083 | 0.815 0.037 | 0.315 0.026 | |
| 0.649 0.080 | 0.829 0.039 | 0.314 0.026 | |
| 0.651 0.088 | 0.831 0.033 | 0.314 0.026 |
[1]: Lee, Kyungmin, et al. "Direct consistency optimization for compositional text-to-image personalization." arXiv preprint arXiv:2402.12004 (2024).
[2]: Podell, Dustin, et al. "Sdxl: Improving latent diffusion models for high-resolution image synthesis." arXiv preprint arXiv:2307.01952 (2023).
Dear Reviewer tGfs,
Thank you again for reviewing this paper. Since the reviewer-author discussion phase is closing soon, could you please respond to the authors' comments?
Best,
AC
Global response
We would like to thank all the reviewers for providing high-quality reviews and constructive feedback. We are encouraged that the reviewers think our paper "generated images presented in the paper demonstrate superior text fidelity of the proposed method. (Reviewer tGfs)", "proposed reward-based model selection and regularization are reasonable and compelling (Reviewer tGfs)", "significantly reduces the computational resources (Reviewer 5xjH)", and "successfully extend preference optimization methods to subject-driven image generation (Reviewer u7k9)".
Below, we summarize our major work for this rebuttal.
-
Extensive baseline comparison. In response to Reviewers tGfs, Kj9e, and BTXt, we compare RPO to additional baseline methods.
-
Additional visual results. We generated 32 additional images for the unseen and highly imaginative prompts to address the concerns of Reviewers 5xjH and u7k9.
Additional Response to Reviewer Kj9e
Due to the character limit for the rebuttal, we provide extensive comparisons with additional baseline methods and discuss the differences between our approach and another method, PALP [8], which focuses on balancing text-to-image alignment.
Q: Extensive comparisons with other baseline methods.
We compared all personalization and text-image trade-off baseline methods Reviewer Kj9e mentioned in the DreamBench dataset, and the results are shown in the following table. We highlight that our method, RPO, still achieves the highest CLIP-I and CLIP-T results across these new baselines.
| Method | DINO | CLIP-I | CLIP-T |
|---|---|---|---|
| DisenBooth [1] | 0.574 | 0.755 | 0.255 |
| Multi-concept [2] | 0.695 | 0.801 | 0.245 |
| ELETE [3] | 0.652 | 0.762 | 0.255 |
| IP-Adapter [4] | 0.608 | 0.809 | 0.274 |
| SSR-encoder [5] | 0.612 | 0.821 | 0.308 |
| Ours: RPO | 0.652 | 0.833 | 0.314 |
Q: Discussion of PALP
The PALP method [6] is designed for balancing the trade-off between personalization and text-to-image alignment. PALP involves two different phases: the personalization phase and the prompt-aligned phase. The first phase minimizes the image similarity loss. During the second phase, PALP optimizes the model by taking the gradient (Equation 6 in [6]). Here we use to represent the fine-tuning parameters, be the parameters of the pre-trained model, is a prediction of (Equation 15 in [7]), , and is the personalized prompt with an identity token, and is the clean prompt, e.g., and .
Following the idea from [8] (Appendix A.4 in [8]), we can derive the following equation:
Thus, updating parameters by is equivalent to minimizing the KL divergence between and . PALP alleviates the text-to-image alignment by restricting the learned model to be close to the pretrained model. However, in RPO, we optimize the lower bound of RL objective function (Equation 3 in our paper).
RPO does not only include the penalty of the KL divergence between the learned probability distribution and the pretrained probability distribution, but also utilizes the reward signals. Compared to PALP, our method should be more flexible since we have no assumptions about the reward signals, and these reward signals can also be adjusted to other objectives, e.g., aesthetic score.
[1] Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation, Chen et al., ICLR 2024.
[2] Multi-concept customization of text-to-image diffusion, Kumari et al., CVPR 2023.
[3] ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation, Wei et al., ICCV 2023.
[4] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, Ye et al.
[5] SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation, Zhang et al., CVPR 2024.
[6] Arar, Moab, et al. PALP: Prompt Aligned Personalization of Text-to-Image Models. arXiv preprint arXiv:2401.06105 (2024).
[7] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020): 6840-6851.
[8] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
Dear Reviewers,
Thank you for your comments. Please read through all the reviews and the rebuttal and see if authors' responses have addressed your and others' concerns.
Best,
AC
This paper proposes the lambda-harmonic reward function and the reward preference optimization (RPO) algorithm for subject-driven text-to-image generation. Initially, reviewers raised concerns such as the lack of comparisons with more baselines and related works, the lack of evaluation metrics, insufficient visualization results, experimental analysis, limited novelty, marginal improvement, and overfitting risk. The rebuttal successfully addressed most of these concerns and all reviewers recommended positive final ratings. Considering that the proposed method is reasonable and experimental results show its effectiveness and efficiency, as appreciated by reviewers, the AC follows this unanimous recommendation. Reviewers did raise valuable concerns that should be addressed. The authors should ensure that all new experiments and discussions provided during the rebuttal phase are included in the camera-ready version.