Preventing Shortcuts in Adapter Training via Providing the Shortcuts
We find that rerouting spurious shortcuts in adapter training enables robust disentanglement for text-to-image generation with adapters.
摘要
评审与讨论
Aiming at the "shortcut learning" problem of adapter training in text-to-image (T2i) generation, this paper proposes a shortcut rerouted adapter training method. The traditional adapter training is easy to entangle the target attributes (such as identity) with the hybrid factors, such as posture and expression, resulting in the generated results being unable to respond to the prompt words. By introducing auxiliary modules (such as ControlNet or LoRA) to explicitly divert the confounding factors in training, the author forces the adapter to learn only the target attributes. Experiments show that this method significantly improves the controllability of posture and expression and the quality of generation in face and whole body personalized generation tasks, which is superior to baseline methods such as InfU and PuLID.
优缺点分析
Strengths:
- A counterintuitive but effective solution is proposed: by actively providing a "shortcut" to eliminate the dependence of the adapter on confounding factors, the theoretical framework is clear (formal problem of probability modeling), and there is no need to modify the basic model.
- Combining Lora and ControlNet, two mature tools to realize shortcut rerouting, it has both practicality and scalability.
Weaknesses:
- The method relies on specific tools such as ControlNet or LoRA, and does not verify the applicability of other confounding factors (such as background style, lighting), or whether it applies to more complex scenes.
- From the qualitative experiments shown, it seems that the improvement is limited.
问题
- Does the SR-IPA mentioned in this paper have this decoupling ability for any input image after training, or does it simply need to be trained for each case, which means that each case corresponds to a specific SR-IPA weight?
- How to define these confounding attributes? So far, this paper focuses on the decoupling of face identity information and its pose or background. Is there any other decoupling ability? For example, for face only, can we decouple makeup, hair color/style, glasses, and other attributes?
- This paper uses a text-to-image generation model. How to input the prior examples given in this paper? Is it to directly use some existing tools, such as ip-adapter, to encode and input images? Or enter the text prompt of the corresponding image? If a text prompt is entered, how can to ensure the consistency with the prior image?
- In the experiment section, it is declared that a dataset has been collected. Will the dataset be open-accessed?
局限性
yes
最终评判理由
Thank you very much for the detailed response, which effectively resolved my question. I am very pleased with the contribution of this work to the community and look forward to the author's future work with stronger attribute control abilities. Considering the author's response and the opinions of other reviewers, I have decided to increase my rating.
格式问题
On page 2, line 44, there is a 'fig', I think that is a mistake.
We thank reviewer fa9d for their constructive comments and feedback. We are delighted that they found our solution "counterintuitive but effective," which precisely reflects the sentiment we aimed to convey with our title.
Weakness 1: Other confounding factors like “background” and “lighting”, or “complex scenes”
Motivated by the suggestion to test SR on additional attributes and the practical observation that lighting often "leaks" from the input image, we conducted a new experiment. When prompted to generate an image with artistic lighting, this effect is frequently diminished in the personalized model, as the input image typically features more ordinary, everyday lighting. Based on this, we experimented with shortcut rerouting for lighting.
To establish a lighting shortcut, we trained an IPA where the input image was the target image with the human masked out. This adapter is designed to encode all target image information except the human. We then used this adapter as a shortcut to train our personalization adapter (IPA w/ SR-LG). Our preliminary results indicate that this mechanism effectively factors out lighting and improves adherence to stylistic prompts during generation. Due to the inability to include qualitative results in the rebuttal, we used the LPIPS between the generated and prior images as a proxy for lighting preservation. Additionally, we conducted a user study across all testing prompts where raters viewed the prompt, prior image, and a generated image, and answered: “Is the lighting harmonized with the prompt?”
| Methods | LPIPS ↓ | User Ratings ↑ |
|---|---|---|
| IPA | 0.433 | 33.6% |
| IPA w/ SR-LG | 0.344 | 93.2% |
These results demonstrate that SR effectively factors out attributes such as "lighting" and improves adherence to the prior. We will include this new experiment and its qualitative results in the revision.
Weakness 2: Limited improvements in qualitative results
Our primary contribution is "shortcut rerouting training." We aim to validate its effectiveness in eliminating shortcuts during inference, specifically "expression," "pose," and "distribution shift." Figures 5 and 6 qualitatively demonstrate these improvements, showing that SR-CN results more closely resemble "Prior" generations than baselines. "Prior" generations were produced by the same T2I model without the adapter, using identical prompts and seeds. An ideal adapter, free of shortcuts, should optimally maintain the pose and structure of this "prior" image. We will make this point more prominent in the revised writing.
Q1: Does the SR-IPA mentioned in this paper have this decoupling ability for any input image after training, or does it simply need to be trained for each case, which means that each case corresponds to a specific SR-IPA weight?
Once trained, the SR-IPA operates in a feedforward manner and generalizes to unseen input identities without requiring retraining. The adapter is trained to factor out specific shortcut cues (e.g., lighting, background, pose, expression), and this decoupling behavior holds for new test images as long as the shortcut definition remains the same. If users wish to factor out a different type of shortcut, a new adapter must be trained accordingly. In other words, the decoupling generalizes across identities, but not across shortcut types.
Q2: How to define these confounding attributes? So far, this paper focuses on the decoupling of face identity information and its pose or background. Is there any other decoupling ability? For example, for face only, can we decouple makeup, hair color/style, glasses, and other attributes?
We came to the choice of these confounding attributes–“expression”, “pose” and “distribution shift”--from practically trying to build a useful appearance adapter. We seek a “face” (or “fullbody”) adapter that can generate anybody in any imaginary scene. Yet, in practice, we found that existing adapters failed to control the character’s expression through prompt, and failed to adapt the characters' pose through prompt. These severely limited the kind of images we can generate.
Your suggestion of decoupling makeup, hair, and accessories sounds interesting! “Hair” and “accessories” can likely be masked out by segmentation masks to a large extent. “Makeup” sounds like a good attribute to factor out given a dataset that isolates “makeup”.
Q3: This paper uses a text-to-image generation model. How to input the prior examples given in this paper? Is it to directly use some existing tools, such as ip-adapter, to encode and input images? Or enter the text prompt of the corresponding image? If a text prompt is entered, how can to ensure the consistency with the prior image?
The prior images in the paper are generated using the same base text-to-image model (FLUX.1 Dev) with the same prompt and a fixed random seed— just without using any image encoder or IP-Adapter. These "prior images" are not used as inputs to our method. Instead, they serve as a reference to visualize what the model would produce without any adapter conditioning. Their purpose is to illustrate how much the adapter alters the layout, lighting, pose, or expression. An ideal "appearance" adapter should preserve these attributes (e.g. layout of IPA generation should remain same as in prior image). As shown in Figures 5 and 6, our method produces images that stay more faithful to the prior compared to baseline methods.
Q4: In the experiment section, it is declared that a dataset has been collected. Will the dataset be open-accessed?
We are currently exploring options for responsibly sharing the dataset. In the meantime, we emphasize that our contribution, Shortcut Rerouting training, is simple in both concept and implementation. To support reproducibility, we provide detailed methodological descriptions and include additional experiments that demonstrate the generality of our approach. Given the minimal requirements of our method, we anticipate that it can be readily reproduced using similar data.
We believe our responses have addressed your concerns and clarified any confusion. We respectfully request you to consider updating your rating, as the additional analyses and clarifications strengthen the case for acceptance. We are available for further dialogue and to refine our submission should any questions arise during the discussion period.
Thank you for your detailed response, which effectively resolved my doubts. Although this paper does have certain limitations, such as potential flaws in controlling granularity, the author honestly and explicitly points them out. According to the author's answer, it seems that the corresponding dataset can change the control attributes of this method accordingly. I'm looking forward to the author's future work. I believe this is a method that may be widely applied in the future, and I hope to open-source it as soon as possible to promote community development. Overall, this is a solid and novel paper; I will correspondingly increase my rating.
This paper proposes a new training method for adapters that aims to disentangle confounding factors that may be present in the input by explicitly providing shortcuts for them. For example, if one had to train an adapter for only facial identity transfer this paper proposes adding additional modules to the base model (via LoRA or a ControlNet) to regress other features corresponding to other factors such as pose, background features, facial expression etc. The paper claims that providing such “shortcuts” helps the target adapter to only focus on the desired attribute (in this case facial identity) and ignore the confounding factors (the aforementioned attributes). As a consequence, during inference all other confounding adapters can be dropped and the only the trained adapter can be used to transfer the desired attribute. The paper tests the proposed method on two target attributes Facial Identity Transfer, and Body Transfer. Both qualitative and quantitative results demonstrate the efficacy of this method. However, more attributes need to be tested as well for example, the body pose where the person identity is a shortcut.
优缺点分析
Strengths:
-
The paper is well written and easy to follow
-
The Idea is very intuitive and well motivated
-
For two specific attributes the paper presents ample evidence,
Weaknesses
- Only two attributes tested
问题
I would strongly encourage the authors to include an ablation of the results wrt the relative representational strength allocated to the shortcut
局限性
Test cases are too limited
最终评判理由
The authors addresed my concers clearly by the lighting shortcut experiment. Looking at the other reviews and rebuttal I am convinced that this method provides usefull insights into conditional large scale foundation models in a modality specific manner and will be very valuable to the research community in general.
格式问题
N/A
We thank the reviewer for the encouraging feedback and thoughtful comments. Below, we address the main concern raised:
Limited number of tested attributes
We agree that expanding the range of evaluated attributes would further strengthen our claims.
We investigated specific shortcuts due to their practical implications. During IPA development and testing, we observed that generated images often failed to align with the prompt, as pose and expression from the input image "leaked" through (i.e., pose/expression shortcuts). Additionally, the aesthetic quality of the trained model deteriorated compared to the base FLUX.1-Dev model (i.e., distribution shift). These observations motivated our study of these attributes.
Motivated by the suggestion to test SR on additional attributes and the practical observation that lighting often "leaks" from the input image, we conducted a new experiment. When prompted to generate an image with artistic lighting, this effect is frequently diminished in the personalized model, as the input image typically features more ordinary, everyday lighting. Based on this, we experimented with shortcut rerouting for lighting.
To establish a lighting shortcut, we trained an IPA where the input image was the target image with the human masked out. This adapter is designed to encode all target image information except the human. We then used this adapter as a shortcut to train our personalization adapter (IPA w/ SR-LG). Our preliminary results indicate that this mechanism effectively factors out lighting and improves adherence to stylistic prompts during generation. Due to the inability to include qualitative results in the rebuttal, we used the LPIPS between the generated and prior images as a proxy for lighting preservation. Additionally, we conducted a user study across all testing prompts where raters viewed the prompt, prior image, and a generated image, and answered: “Is the lighting harmonized with the prompt?”
| Methods | LPIPS ↓ | User Ratings ↑ |
|---|---|---|
| IPA | 0.433 | 33.6% |
| IPA w/ SR-BG | 0.344 | 93.2% |
These results demonstrate that SR effectively factors out attributes such as "lighting" and improves adherence to the prior. We will include this new experiment and its qualitative results in the revision.
In conclusion, we hope our responses have sufficiently addressed concerns. We respectfully request you to consider updating your rating, as we believe the additional analyses and clarifications strengthen the case for acceptance. Should any further questions arise during the discussion period, we would be happy to engage in continued dialogue and refine our submission accordingly.
I would like to thank the authors for this additional experiment. It is indeed nice to see that the proposed shortcut also works for lighting and I strongly encourage the authors to include the qualitative results in the final version of the paper. An additional experiment that I think would strengthen the paper is an analysis of the relative representational strength allocated to the shortcut, this would essentially amount to an ablation of the relative position of where the conditoning and the shortcut is introduced (earlier layers would correspond to potentially a larger representational strength to the condition, though this would be architecture dependent). Looking at other reviews and the rebuttal I have decided to raise my score.
Thank you for your thoughtful follow-up and for raising your score. We sincerely appreciate your support.
We're glad to hear the additional experiment on lighting was helpful. As suggested, we will include the corresponding qualitative results in the final version of the paper.
We also appreciate your suggestion to explore the relative representational strength by injecting the shortcut at different layers. We agree it could offer valuable insights, and we will include it in a revision.
Thank you again for your encouraging words and constructive suggestions.
In adapter-based training for image generation models (e.g., LoRA, IP-Adapter), the single-image reconstruction objective causes adapters to overfit to confounding factors (e.g., pose, expression, lighting) in input images rather than the target attribute (e.g., subject identity). This entanglement compromises model modularity, preventing response to textual prompts (e.g., outputs retain training-image expressions despite requests to "change the subject's expression").
优缺点分析
Weakness
-
First of all, the title is very difficult to understand and fails to convey the core idea of the paper. It is unclear what “..well..” is supposed to mean.
-
The use of line breaks in the abstract is quite unconventional—I have never seen this formatting style before.
-
The introduction fails to provide a clear problem definition. The writing is semantically confusing and includes some bizarre expressions, such as the word “fig” in line 44.
-
The mathematical formulations are unnecessarily complex and lack clear meaning. Most of them are quite naïve and could have been simplified. Moreover, the proposed method does not effectively leverage or align with these definitions.
-
The methodological novelty is weak. The approach appears to be a straightforward combination of existing techniques, namely LoRA and ControlNet.
I have serious concerns that this paper may have been generated entirely by an AI large model, starting with its problematic title which is fundamentally nonsensical. Furthermore, the entire manuscript exhibits disorganized descriptions and is riddled with errors, rendering it unacceptable as a substandard manuscript. Consequently, I strongly recommend rejection of this submission.
问题
Please see above
局限性
Please see above
最终评判理由
Thank you for your reply. I have decided to increase my score, but at the same time, I hope you can handle the current issues in the final version with caution
格式问题
Please see above
We respectfully address the criticism raised by the reviewer.
- “Title” – We acknowledge that the title can appear informal. It was a very deliberate decision, and understandably risky. Our intent was to use a conversational tone to highlight the counterintuitive insight that “providing the shortcuts” can prevent shortcut learning. We are happy to discuss the possibility of revising the title, and removing the “ .. well ..”.
- “Format error and typo” – Thanks for pointing out the two syntax errors. We will remove the line break in the abstract, and remove the straying “fig” on page 2.
- “Math” – Our formalism aims to outline the core idea of shortcut rerouting in adapter training – by explicitly routing confounders through separate modules, we change the optimization path and effectively guide the adapter toward disentangled representations.
- “.. approach appears to be a straightforward combination of existing techniques, namely LoRA and ControlNet” – We would like to highlight that our contribution is the “shortcut rerouting” training scheme, and the use of LoRA and ControlNet are just to provide the shortcut. We use these existing tools in very unintended (and novel) ways. In fact we argue that using well-established tools as part of our method adds to the usability of the method, not diminishes it. We respectfully refer the reviewer to Strength 2 highlighted by reviewer fa9d, which emphasizes the advantage of leveraging mature tools to enable shortcut rerouting.
- “have been generated entirely by an AI large model” – We are disheartened to see such a comment directed towards our project. First, as indicated in our authors’ checklist, we acknowledge using LLMs for editing our paper. Second, if the paper were entirely generated using an LLM, it would likely pick a much more conventional title and not cause the concern you had.
This paper concerns the training of "adapters" for text-to-image models. Specifically, the paper discusses the cases where the adapter is supposed to capture the identity or unique look of a reference input image. The proposed SR method (Shortcut-Rerouting) suggests that existing adapter training methods result in confounding factors being transferred from the reference input image to the generated image and that to avoid this, extraneous factors should simply be provided directly as a condition (provide the shortcut). The authors evaluate identity consistency, expression alignment, head/body-pose alignment and other metrics in comparison to existing adapter methods such as IPA and show that their proposed modification (SR-LoRA or SR-CN) result in improvements across the board. The qualitative results show better adherence to both the identity contained in the reference input image as well as better adherence to the content of the prior image (e.g. lighting, pose, expression).
优缺点分析
The paper is written well with clear figures that are easy to understand. The intuition for learning shortcuts (or confounding factors directly) is explained very well in Section 3.1. The overall concept can be easily understood and the method is presented with sufficient details to allow reproduction. The qualitative results are particularly impressive, showing that the proposed modifications can drastically improve identity consistency and adherence to details of the prior image such as pose/expression/etc.
A few questions do arise, however. For instance, the authors provide keypoints as explicit "shortcut" signals during training, but do not discuss the possibility of using other specialist encoders such as those for expression, body shape, attributes (e.g. CelebA). This leaves the user feeling that the overall "shortcuts" and "shortcut-rerouting" idea sounds great, but may only apply to specific cases such as pose / landmarks, without being very general. Furthermore, how can this idea of shortcuts be applied to T2I tasks for non-human subjects?
In addition, the metrics used in Sec. 4.1 are mostly sensible, but the Expr. metric in particular is using an LLM-as-a-judge even though encoders such as EMOCA exist for directly predicting expression parameters. The models used are generally dated, especially because identity is determined using FaceNet instead of the more widely used ArcFace or AdaFace models.
In the quantitative evaluations, the authors do not mention which dataset they use. It appears as though they are using a single internal dataset, which may not allow generalizable insights. As a potential improvement, the authors could consider evaluating on datasets of sufficiently different characteristics, e.g. studio images, in-the-wild, cartoon/drawings.
问题
- This paper only discusses the potential of learning pose/keypoint related shortcuts using LoRA/ControlNet. Did you consider learning other shortcuts using existing encoders? E.g. expression parameters from EMOCA, CelebA attributes, head-pose from HopeNet.
- Did you consider using an ArcFace/AdaFace model to compute identity consistency as is common in face avatar research?
- Is an LLM a good judge of identity or facial expressions?
- Qualitative results for the SR-LoRA method is not shown in neither the main paper nor the supplementary materials. Is there any particular reason for this?
局限性
The authors briefly discuss the limitation of their work (adapter-focused training, not LoRA), but do not discuss any potential negative societal impact.
最终评判理由
I have gone through my peers' reviews and have reviewed my own review and prior assessment.
The authors' answers have satisfied my questions and doubts. Especially, I now understand (also based on the supplementary materials) the suitability of LLM-as-a-judge for aspects that ArcFace or EMOCA cannot cover/assess. I also appreciate the EMOCA similarity evaluations and the evaluation on cartoons/drawings. I believe that the qualitative improvements in the supplementary materials are impressive as there is better adherence to both prior as well as input condition when compared to baselines. I raise my rating accordingly.
格式问题
None in particular.
We thank reviewer wPqP for the constructive and thoughtful feedback on our submission. We appreciate your recognition of the clarity of our presentation, the intuition behind shortcut rerouting, and the strong qualitative results demonstrating improved identity and attribute preservation. Your comments have been very helpful, especially the suggestion to use the EMOCA model for expression evaluation. We address the concerns raised below.
Q1: “Did you consider learning other shortcuts using existing encoders? E.g. expression parameters from EMOCA, CelebA attributes, head-pose from HopeNet ...”
We are grateful for the suggestions to explore shortcuts using existing encoders like EMOCA or HopeNet. Our method establishes pose and expression shortcuts via ControlNet. Achieving a similar outcome with EMOCA/HopeNet would require learning an adapter compatible with their outputs, which would be less user-friendly than the presented approach.
We believe our extensive experiments sufficiently demonstrate the effectiveness of SR training in human-centric generation tasks. While our SR training strategy is inherently task-agnostic and broadly applicable, extending it to non-human subjects is beyond the scope of this work. As a team focused on human-centric personalization, we currently lack the resources (e.g., data and compute) and domain expertise to train adapters for general objects. Nevertheless, we hope our findings can inspire researchers in related areas to explore this direction further. For instance, training a background adapter (SR-BG) and incorporating it into object adapter training may help mitigate background and lighting leakage—potentially enhancing generalization in non-human scenarios.
Q2: Usage of FaceNet vs. ArcFace/AdaFace
We appreciate the suggestion to use ArcFace for identity evaluation. However, we note that competing baselines such as InifinteYou and PuLID use ArcFace as their image encoder. To avoid coupling and biasing the evaluation, we chose FaceNet (i.e. an encoder not used by any of the methods) to ensure a neutral comparison. FaceNet was also used as a standard metric in recent studies on face swap [Huang et al., 2024] (where face similarity is the focus).
Q3: Evaluation for identity preservation and expression control
We included LM-as-a-judge for several reasons.
Existing encoder-based face identity metrics (e.g., FaceNet, AdaFace, ArcFace) are limited to evaluating only facial features. They fail to assess other critical aspects of appearance, such as head shape and hair. To overcome this limitation and achieve a more comprehensive identity evaluation, we adopted LM-as-a-judge.
Furthermore, LM-as-a-judge has been previously employed in papers requiring custom metrics for controlled generation, as exemplified by Stylus [Luo et al., 2024], a work cited in our submission from the last NeurIPS.
For evaluating expression, using LM-as-a-judge followed a similar thought process. Yet, we really appreciate the suggestion of using existing blendshape prediction models like EMOCA for evaluation. We evaluated the similarity of expressions by inputting both prior and generated images into the EMOCA model. We then calculated the cosine similarity of their respective EMOCA outputs. Prior images were created using the base Flux model, without personalization. A higher cosine similarity indicates a greater resemblance in expressions. Ideally, a personalization adapter should achieve a high value, demonstrating that it does not leak expression from the input image. Below are the table of results:
| Methods | InfU | PuLID | IPA | SR-LoRA | SR-CN |
|---|---|---|---|---|---|
| EMOCA Sim.↑ | 0.542 | 0.489 | 0.347 | 0.458 | 0.580 |
The result of EMOCA similarity shows that IPA trained with SR-CN (i.e., factoring out the 2D keypoints) indeed achieves a higher similarity with the expression in the prior model. In terms of downstream impact, this indicates that such a model is more effective in adhering to the expression prompt, hence allowing users to create more diverse generations without being limited by a personalization adapter with expression leakage.
Q4: "Evaluation set".
Our evaluation set includes 32 identities from real-world photos, encompassing both studio images and in-the-wild selfies, as well as 146 diverse prompts covering 5 different test cases (see Appendix G for details). We thank the reviewer for the suggestion to evaluate on cartoon or drawing-style inputs, which is out of our training distribution—we view this as a valuable and interesting direction for future work (to include stylized inputs in training).
That said, we have included 10 stylized prompts with cartoon/drawing aesthetics to validate the capacity in generating stylized images with photorealistic input faces. In these cases, our method (IPA with SR-CN) significantly outperforms both the baseline IPA (without SR-CN) and other state-of-the-art methods, InfU and PuLID. The baseline IPA consistently fails to produce stylized outputs, generating photorealistic images regardless of the prompt. While InfU and PuLID are capable of generating stylized images, they struggle to preserve identity. In contrast, our IPA with SR-CN successfully follows the stylized prompts and produces cartoonized images while maintaining identity fidelity. As images are not allowed in rebuttal, we conducted a user study in which participants were asked to select their favorite generation among all methods:
| InfU | PuLID | IPA | IPA w/ SR-CN |
|---|---|---|---|
| 6.7% | 16.7% | 6.7% | 73.3% |
Q5: “Qualitative results for the SR-LoRA method is not shown in neither the main paper nor the supplementary materials”
We would like to clarify that SR-LoRA qualitative results are included in the supplementary material (see Appendix Fig. 3, 4, and 5 with the column name “IPA w/ LoRA”). We apologize if this was not sufficiently highlighted in the main text. We will rename it as “IPA w/ SR-LoRA (Ours)” and make clearer references in the revision to guide readers to these results.
Additional Comment: Potential negative societal impact
We thank the reviewer for pointing out that there was no explicit discussion regarding the potential negative societal impact. We will add the following paragraph in the revision:
As our method improves identity preservation and expression control in personalized generation, it naturally raises concerns about misuse, particularly in the creation of hyper-realistic synthetic identities or deepfakes. We acknowledge that enhanced controllability and realism may lower the barrier for malicious use. To mitigate such risks, we advocate for responsible deployment practices, such as model watermarking, usage restrictions, and alignment with ethical frameworks for generative media.
In conclusion, we hope our responses have sufficiently addressed your questions and clarified the contributions and design choices of our work. We respectfully request you to consider updating your rating, as we believe the additional analyses and clarifications strengthen the case for acceptance. Should any further questions arise during the discussion period, we would be happy to engage in continued dialogue and refine our submission accordingly.
References
- [Luo et al., 2024] Luo, Michael, et al. "Stylus: Automatic adapter selection for diffusion models." Advances in Neural Information Processing Systems 37 (2024): 32888-32915.
- [EMOCA] https://github.com/radekd91/emoca
- [Huang et al. 2024] Huang, Ziyao, Fan Tang, Yong Zhang, Juan Cao, Chengyu Li, Sheng Tang, Jintao Li, and Tong-Yee Lee. "Identity-preserving face swapping via dual surrogate generative models." ACM Transactions on Graphics 43, no. 5 (2024): 1-19.
- [InfU code reference of using ArcFace] https://github.com/bytedance/InfiniteYou/blob/5fff05b3cba6c119857d62d9cc27a90bcc501187/pipelines/pipeline_infu_flux.py#L269
Thank you for the thoughtful and detailed response to my review. I have gone through my peers' reviews and have reviewed my own review and prior assessment.
Your answers have satisfied my questions and doubts. Especially, I now understand (also based on the supplementary materials) the suitability of LLM-as-a-judge for aspects that ArcFace or EMOCA cannot cover/assess. I also appreciate that you did the EMOCA similarity evaluations anyway and show better performance than the baselines. I believe that the qualitative improvements that you show are impressive as there is better adherence to both prior as well as input condition when compared to baselines.
I also appreciate that you followed-up on my off-hand comment on cartoons/drawings going as far as performing a user study to show the improvement of your method and shared a draft of a "potential negative societal impact" section with me.
Thank you again for your efforts and contributions.
Thank you very much for your thoughtful follow-up and kind words. We're really glad to hear that our responses and additional materials helped address your concerns, and we truly appreciate the time you took to re-evaluate our work during the discussion phase.
Your suggestion to incorporate EMOCA for evaluation was especially helpful. It not only strengthened our rebuttal but will also improve the final version of the paper.
If you feel that our clarifications and new results merit it, we politely invite you to consider updating your overall score in the final review. Thank you again for your thoughtful engagement and support!
This paper received overall positive reviews. Reviewers appreciated the interesting idea, solid results, and the enhanced clarity provided in the rebuttal. The AC concurred with the reviewers' consensus and considers this a solid contribution worthy of acceptance. Please integrate the promised improvement into the camera ready, including title.
However, there is room for improvement. The paper would benefit from deeper analysis to draw more insightful conclusions and from broadening its scope beyond facial-related text-to-image generation. Currently, the method for decoupling attributes in generic images (beyond faces) remains unclear. Additional research and development are needed to strengthen the impact of this work, potentially making it suitable for spotlight or oral presentations.