FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models
摘要
评审与讨论
The paper tackles the problem of single-shot fine-tuning of text-to-image models for diverse subject-driven renditions. It first discusses the problem of image-text alignment present in the few-shot fine-tuning paradigm for text-to-image models. It then presents FineStyle. More specifically, it introduces
- a novel data augmentation technique to synthetically increase the number of image-text pairs from just a single pair
- concept-oriented masking during the (parameter-efficient) fine-tuning phase
优点
- The paper identifies the problems present in the existing few-shot fine-tuning frameworks for text-to-image models.
- The proposed method is simple and is well demonstrated.
- The derivation of segmentation maps from cross-attention maps for concept-oriented masking is beautiful.
缺点
- Minimal details available on the pre-trained model being used. The paper just mentions MUSE. It didn't mention its capacity. Similarly, it didn't provide any details on the VQGAN being used.
- Lack of references provided to the works that leverage parameter-efficient fine-tuning for controlled generation in the domain of text-to-image models. Some examples include [1], [2], and [3]. I believe this is relevant since the authors use parameter-efficient fine-tuning as well.
- FineStyle was demonstrated for masked models like MUSE. But the image generation community doesn't use MUSE that much. So, I am a little concerned about its adoption at scale. It would be very nice if the authors could also showcase some results obtained from applying FineStyle to open text-to-image models such as [4] and [5].
- Timing information would have been nice to include as this study aims to avoid the limitations of the iterative fine-tuning scheme introduced in StyleDrop.
References
[1] https://github.com/cloneofsimo/lora
[2] Using LoRA for Efficient Stable Diffusion Fine-Tuning, https://huggingface.co/blog/lora
[3] Implicit Style-Content Separation using B-LoRA, https://arxiv.org/abs/2403.14572
[4] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, https://arxiv.org/abs/2307.01952
[5] PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation, https://arxiv.org/abs/2403.04692
问题
Suggestions
- Could the Figure 1 be modified as follows? Each row would have a starting image (style image can be overlapped on it as it is currently), and their variants would be obtained by testing different aspects such as color, compositionality, etc.
- Figure 1 could also hint about the augmentation scheme being used to upscale the dataset. Just writing a single image-text pair (as is the case right now) doesn't sound technically right, especially after reading the nice data augmentation introduced in the paper.
- In Figure 2, for the bottom row, the content leakage isn't very evident to my eyes. Is it possible to include a stronger example that immediately establishes the point, like the former row?
- I included this point in the "Weaknesses" too but for clarity, I will include it here as well. The parameter-efficient fine-tuning scheme sounds extremely similar to LoRA [1]. It might be worth clarifying the differences if there are any. Additionally, I think it would be sensible to include the concurrent works that make use of parameter-efficient fine-tuning in the context of text-to-image generation (provided some references in "Weaknesses").
Questions
- There is human evaluation involved yet checklist pts. 14 and 15 are written as NA. Is this expected?
- Is using the pre-trained MUSE model sufficient to extract the segmentation maps from images coming from non-natural domains? If so, it might be worth mentioning it with a few examples.
- Could the image-text misalignment problem be mitigated if the text encoder was also fine-tuned? Since MUSE uses T5-xxl and it already supports longer prompts, I believe this is worth trying to compare. Additionally, T5-xxl displays fine-grained understanding of text, as shown in Imagen [2].
- Is FineStyle particularly effective for style images with multiple concepts? How about simple style images?
- 145 - 151: I like the approach. However, it appears tedious. Have the authors explored automating this using an LLM? If so, I would appreciate some results.
- Have the authors tried using/re-purposing the prior-preservation loss introduced in DreamBooth [3] in eqn. 4?
- As per equation 4, it seems like we need to keep two models in memory for sampling. This appears to be memory-expensive. Or do the authors just use one base model and enable and disable the adapter layers when needed (reference [4])?
- In Figure 4, in the last row, "Christmas decoration" is still present in the first example. Why is that? Is that a failure case?
- How is the notion of "unwanted concepts" implemented in practice with FineStyle? Is it similar to negative prompting (implemented through classifier-free guidance)?
References
[1] LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685.
[2] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, https://arxiv.org/abs/2205.11487.
[3] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, https://arxiv.org/abs/2208.12242.
[4] StackLLaMA: A hands-on guide to train LLaMA with RLHF, https://huggingface.co/blog/stackllama#reinforcement-learning-from-human-feedback.
局限性
No comment here.
Thank you for the insightful comments and suggestions! We appreciate them and will make the necessary revisions to ensure our work is presented with better accuracy and completeness.
W1: As detailed in Section 3 of our paper, Muse employs a cascaded design of generative modules. FineStyle only adapts the lowest resolution model, which operates at a resolution of 256x256 and comprises 3B parameters. This model features a masked generative image transformer that serves as the image decoder and a VQGAN for converting images from pixel to latent space and vice versa. The VQGAN encodes a 256x256 image into a sequence of 16x16 latent visual tokens for the image decoder to consume.
W2: Thank you for your valuable feedback regarding the citation of related works. We acknowledge the oversight in not sufficiently referencing prior studies that have utilized parameter-efficient fine-tuning in the domain of text-to-image models. Unlike the typical application of LoRA layers across each transformer block of the image decoder, our method employs a singular main LoRA layer but modifies it with distinct biases for each transformer block. This adaptation not only reduces the number of trainable parameters but also aims to mitigate potential overfitting issues, a critical aspect in maintaining model generalizability. We will carefully revise the relevant sections of our paper to include these references and articulate the different design choices of our adapter compared to LoRA.
W3: See Figure 2 of the attached PDF. We showcase some results from SD3.
W4: Although iterative human feedback can enhance style fine-tuning performance by expanding the training dataset with human-annotated synthetic images, this approach is cumbersome, risky, and time-consuming. The limitations of human feedback include: Human intervention during the training process is necessary to curate a set of synthetic images, which incurs significant labor costs. The effectiveness of using synthetic data for iterative human feedback depends critically on the quality of the synthetic images produced by the style fine-tuned model. If the images do not meet a certain standard, performance may deteriorate. In practice, even the most advanced text-to-image models can struggle to accurately replicate the compositions of certain reference style images. The time required for labeling through human feedback is contingent on the quality of the synthetic training data; poorer quality requires more time. Regarding timing, StyleDrop, with one round of iterative human feedback, takes at least three times as long as FineStyle with additional labor costs.
Q1: The answer to checklist point 14 should have been Yes. We apologize for the mistake. For checklist point 15, given that the paper is approved for submission, I assume that the inclusion of human preference study is also approved.
Q2: As demonstrated in Figure 3(d), using cross-attention weights of the pre-trained model as segmentation map might not be accurate like the bottom right corner outliers. We do notice that it can get worse for non-natural domains, possibly resulting in reduced fine-grained controllability. We expect to explore the potential of fine-tuning a pre-trained model to improve its cross-attention alignment between visual and textual tokens corresponding to the same concept as future works, which could land a better foundation to answer this question.
Q3: This is a valid point, since outputs of T5-xxl are fed right into cross-attention layers modified by our kv adapter. However, the downside is T5-xxl actually contains more parameters than Muse image decoder, 4.6B v.s. 3B. Tuning the full text-encoder is computation prohibitive for most users. We would like to study light-weighted adapter for text-encoders in future works.
Q4: FineStyle can effectively process simple style images based on straightforward descriptions. For example, the second and third rows of Figure 12 show the styles “rainbow color flowing design” and “bookshelf in watercolor painting”, which are simple concept compositions.
Q5: See Author Rebuttal and Figure 3 of the attached PDF.
Q6: We tried using a style descriptor, e.g. “in watercolor painting style.”, as a prior-preservation anchor. It can mitigate concept collapsing to some extent, but doesn’t show concrete evidence of improving fine-grained style controllability. Unlike prior-preservation loss that requires auxiliary training data, FineStyle does not require any other image but the one style reference image.
Q7: Correct, the FineStyle adapter operates as a standalone layer with its parameters. It is designed to be dynamically attached or detached from the base model as needed.
Q8: We apologize for the confusion with the use of the double negative for listing “tree WITHOUT christmas decoration” under unwanted concepts. We will rephrase this to be listed as “bare tree” under unwanted concepts. The first example is NOT a failure as based on the synthesis prompt we expect generated trees to contain a form of christmas decorations. In contrast, StyleDrop mode collapses to the exact pine tree of the reference image.
Q9: We don’t use negative prompting during inference. The "Unwanted concept" column in Figure 4 is designed to demonstrate concepts that are implicitly inferred from the synthesis prompt. Our preliminary experimental findings have shown that negative prompting at inference time does not effectively counter mode collapse, which actually prompts us to seek an efficient fine-tuning strategy that can better disentangling style and visual elements of a subject.
Glad that LLMs helped! Looking forward to seeing this section in the main paper. Additionally, it might have been better to try it out with an open vLLM.
[W1]
I agree that Section 3 has details on MUSE, but I think there is room for improvement. For example, you could specify the MUSE variant you used in your work.
3B parameters for a model for generating 256x256 images seems significantly high given other models (diffusion) such as SDXL [1], Stable Diffusion 3 [2], etc., can operate on 1024x1024 resolution but have a significantly lower number of parameters. I think this questions the use of such as a pre-trained backbone because it hinders applicability. So, it might have been prudent to consider a more realistic model justifying the operating resolution and their parameter count. Furthermore, I think this is a fair comparison with latent diffusion models because your work also features a VQGAN operating on the latent space.
(I stand with my general point here i.e., the use of a better backbone with better applicability compared to MUSE. I acknowledge that I have seen Figure 2 of the Rebuttal Document that shows SD3 [3] results.)
Was the original VQGAN from [4] used? If not, I think the differences deserve a place in the main text.
[W2]
Agreed with the overall point conveyed. Thank you.
[W3]
Could this be reflected more quantitatively in the paper? I think this aspect of your work is equally important to highlight more explicitly.
For the answers provided to my questions, I would like to suggest the authors to consider including some of those details in the paper. For example, Q7.
References
[1] Podell et al., 2023, SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
[2] Esser et al., 2024, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
[3] Esser et al., 2020, Taming Transformers for High-Resolution Image Synthesis
We thank the reviewer again for supporting this work and their insightful suggestions in the first response that helped us identify new areas of improvement: leveraging open vLLMs such as GPT4o for automated sub-prompt generation. We believe FineStyle will offer greater benefits to the AI/art community by integrating these additional technologies. We also thank the reviewer for the new comments and suggestions on the details and backbones, we will answer them as follows and also integrate them into our revision!
More details on Muse
Why we used MUSE?
We used MUSE for this work because our major baseline StyleDrop (published NeurIPS 2023) was built on top of it. We leveraged the same backbone to ensure our study is comparable and thorough. In the StyleDrop paper, they did not use other open-source t2i backbones and also there is no official codebase for us to compare with. A caveat about this line of research topic the reviewer might already be aware is that a series of such influential works like DreamBooth, StyleDrop were never open-sourced and they were also built entirely on close-sourced models like Imagen and MUSE which limits the impacts and follow-up works in the community.
W1(1): I agree that Section 3 has details on MUSE, but I think there is room for improvement. For example, you could specify the MUSE variant you used in your work.
Since the official Muse model we obtained for this research is not openly accessible, it was our oversight not clarifying the details enough in the submission and rebuttal. The Muse model variant we used is “tigg_v2_5_3b_512”. It has several sub-models: a pair of low-res and high-res VQGAN operating at 256x256 and 512x512 resolution respectively, a base transformer for decoding low-res image tokens at 256 and a superres transformer for translating low-res image tokens to high-res at 512. All the sub-models comprise the 3B parameters, while the low-res base transformer contains the bulk of the parameters. FineStyle only adds adapters to the low-res base transformer, and also we ensured the exact same configuration for our experiments related to StyleDrop for fair comparison and better understanding of the effects of our proposed method.
W1(3): Was the original VQGAN from [4] used? If not, I think the differences deserve a place in the main text.
We follow the exact setup of StyleDrop/MUSE and both VQGAN models used were trained on internal text-to-image datasets, which differ from those used in [3]. We will explicitly note this difference in the main text.
Overall, We will definitely include more relevant details from the answers to review questions in the revised version of this paper or its supplementary.
More on Backbones and Our Commitment to Open-Source
W1(2, modified): add comparisons with latent diffusion models
Yes, we agree! While our baseline method StyleDrop did not compare with latent diffusion models, we strongly agree with the reviewer’s suggestion and plan to include a comparison with latent diffusion models in the revised version. We are currently actively working on implementing FineStyle with the latest open text-to-image models, including SD3 [2] and Flux [1] - for SD3 we were able to show our initial results in the rebuttal PDF. In future revisions, we will open-source our implementation and include additional quantitative results, such as CLIP scores and human evaluations.
Lastly, we want to emphasize our commitment to open-source, which is different from some related prior works including StyleDrop. We are committed to open-source this project to contribute to the AI community, and that is exactly the reason why we spent a significant amount of extra time and effort in the summer after the NeurIPS submission deadline to rewrite the entire code-base and run many experiments based on the most recent state-of-the-art open-source models like SD3 [2] (released in July and published in the summer in ICML 2024, not fully open on training codes). We are also exploring the most recent powerful models Flux series [1] (released August 1, no training code) and we hope to incorporate those open models in the open-source project to contribute to the AI + Art/Design community. To this end, we fully agree with the reviewer on the importance of utilizing more applicable and openly accessible backbones for implementing and evaluating FineStyle and we hope our efforts here are evident and will gain the support of the reviewers.
(In any case even if this work is unfortunately rejected by NeurIPS, we may still wrap up this project and open-source this project before we move on to other research topics since we believe in the value of this work to the community.)
Thank you!
[1] Flux, https://github.com/black-forest-labs/flux
[2] Esser et al., 2024, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
[3] Esser et al., 2020, Taming Transformers for High-Resolution Image Synthesis
I acknowledge your reply and truly appreciate the efforts you took to explain each and every point in detail. I truly respect your willingness to make your work open so that the entire community can benefit from it. I think it's safe to say that it was because of open-source codebases like LDM [1], the diffusion community, in particular, progressed very rapidly over the past few years. So, your commitment towards open-source is quite endearing and I am sure it will be helpful.
I have nothing further to add and I offer the authors my best wishes.
References
This paper proposes a few-shot fine-tuning paradigm called FineStyle for controllability-enhanced style personalization that requires only a single reference image. A concept-oriented data scaling scheme and a parameter-efficient adapter are two key components of the proposed method to achieve this goal.
优点
- This paper is well-motivated and well-organized.
- This idea of scaling the number of training images by creating multiple sub-images and corresponding separate concepts is interesting and reasonable.
- The controllability of the proposed method is good.
缺点
-
Only one baseline model (i.e., StyleDrop) is adopted to compare with the proposed method. Many highly related SOTA methods are not introduced or compared in this paper, such as DreamStyler [1*], ControlStyle [2*], StyleAligned [10], and IP-Adapter [40]. They can perform the same task as the proposed method.
[1*] DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models. AAAI 2024.
[2*] ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors. ACM MM 2023. -
The proposed method requires very detailed and complex text description for each input style reference image, which is inconvenient and needs human intervention. Moreover, as the authors say, ‘it is often challenging to faithfully describe the visual look of a style in pure text form.’ In contrast, many other methods do not require additional text description of the input style image, such as DreamStyler [1*], ControlStyle [2*], StyleAligned [10], and IP-Adapter [40].
-
The proposed method is inferior to previous method StyleDrop in style learning, which can be observed from both qualitative results and quantitative results.
-
Detailed information about human evaluation is not provided. How many image-text pairs and participants are involved in the conducted human evaluation? In addition, the sum of the user preference proportions reported in Table 2 is not 1.
-
I am curious about the running time of the proposed method. Is it comparable or superior to previous methods in speed at inference?
问题
Please see Weaknesses.
局限性
The limitations and potential negative societal impact are discussed in the supplementary material.
Thank you for your comments, particularly regarding the need for comparison with extra baselines. We appreciate the opportunity to add new baseline results and clarify the advantage of our method.
W1: See Figure 1 of the attached PDF. We include results of DreamStyler, StyleAligned, and IP-Adapter. To the best of our knowledge, there are no openly available codes for ControlStyle. We use SDXL as the base model for StyleAligned and IP-Adapter, while DreamStyler only supports SD 1.5. Both StyleAligned and IP-Adapter show the symptoms of content and structure leakage. The bay and the mountains keep appearing in the first row, even if not in the prompts. For StyleAligned, the bay water is fixed at the bottom left of the generated images. In the second row, IP-Adapter hallucinates mountains and the moon from the reference. In the third row, IP-Adapter almost copies the texture and shape of the house, which doesn’t look like an office building. StyleAligned correctly follows the semantics for the third and fourth rows but doesn’t quite follow the target styles. On the other hand, Dreamstyler fails in both semantic and style consistency and is of lower quality, which is partially due to the less powerful model SD 1.5.
W2: Our method utilizes detailed text descriptions to empower users to control fine-grained styles. This is a key distinction from methods like StyleAligned, IP-Adapter, etc., which interpret the styles of an image as a whole. Although crafting detailed descriptions may require an extra step, this provides users additional fine-grained control in text-to-image generation over how much granularity in a prompt they want rather than requiring exhausting every detail of a reference image. Thus, users can focus on which style aspects they wish to capture, making our method versatile to different user needs. Furthermore, detailed descriptions can also be beneficial in other methods requiring image inversion, such as StyleAligned. As in Figure 4 of the attached PDF, when a simple inversion prompt (a) is "a tall house in watercolor painting style.", gable roofs of the reference leak into synthesized images, which is uncommon for an office building. In contrast, when a detailed prompt (c) containing "gable roof," etc., is used., the problem is improved. We also use a control prompt (b) containing a "<random token sequence>" to make the length of the tokenized sequence the same as (c). This demonstrates it's "gable roof" in the detailed descriptions that improves the situation instead of any random, longer description.
W3: We acknowledge the discrepancies in certain results between StyleDrop and FineStyle in terms of style learning.
In the Eiffel tower example of Figure 6, StyleDrop exhibits a darker blue tone, whereas FineStyle presents a lighter one. In fact, the reference image features a range of tones from dark to light, evident in various elements such as the bookshelf, plant leaves, pot, and laptop. This tone spectrum informs the lighter blue color in FineStyle. While StyleDrop tends to present a darker tone inheriting from the laptop, it does not capture the legit composition of the reference, particularly in the Eiffel tower example with recurring appearance of leaves.
As in the quantitative human evaluation of Table 2, although FineStyle scores lower on Style than StyleDrop, it has a significant edge over StyleDrop in Text and Structure/Common Sense scores. In our human study, we found that out of the image pairs where StyleDrop was deemed as having a superior style, 73% of the time raters found the FineStyle image to follow the prompt the best, 27% percent of the time both methods followed the prompt equally and 0% of the time did the StyleDrop image get rated as better at following the prompt. The same is found when looking at the axis of compositional structure and common sense of the generated images. Out of the image pairs where StyleDrop was deemed as having a superior style, 81% of the time raters preferred the FineStyle image for structure, 19% percent of the time both methods both had equally good structure and 0% of the time did the StyleDrop image get rated as better at structure and common sense.
We think that StyleDrop sacrifices composition for style which can often create images that are far from the intended prompt. These findings underscore that FineStyle is a better method with balanced style learning and composition, with fine-grained style controllability.
W4: In our study, we recruited 14 participants, each of whom evaluated 24 image-text pairs of distinct styles. This resulted in a total of 336 evaluations, ensuring a robust dataset for evaluation. Regarding your observation about the sum of user preference proportions in Table 2 not equaling 1, the proportions indeed sum to 0.998, 0.999, and 0.999 across different comparisons. This slight discrepancy arises from the presentation without rounding the figures to maintain numerical precision. We recognize the need for clarity and will address this in the revised version of our paper, ensuring all totals are rounded appropriately.
W5: Regarding the inference speed, it is true that our method incorporates an additional 0.15M adapter parameters compared to StyleDrop. However, when considered in the context of Muse's overall 3B parameters, this increase is relatively minor. Consequently, FineStyle maintains an inference time that is comparable to that of StyleDrop, ensuring that the enhanced control over style elements does not come at the cost of efficiency.
Existing style-tuning based style transfer methods often result in content leakage because of the coupled style and content. To address this, FineStyle proposes a decomposition conception of style and content in images and fine-tune a kv adapter in cross-attention on MUSE. FineStyle demonstrates better fine-grained control in visual results compared to other methods.
优点
1.FineStyle focuses on fine-grained style transfer and achieves excellent control effects, showcasing the future potential of style transfer. 2.This paper is well written, easy to follow. 3.The experiments are comprehensive, and the appendix provides detailed supplementary information. 4.This paper shows that the kv adapter on cross-attention is better than the feat adapter in hidden states for more fine-grained style or content control.
缺点
1.The author mentioned in Section 4.1 that T2I models require a very large image-text dataset for concept learning.Therefore, some style tuning methods use human feedback to scale up the dataset to achieve better learning outcomes. However, the differences between these two datasets are still substantial. For instance, human feedback often adds only a few images, whereas the dataset for a large T2I (Text-to-Image) model typically contains millions of image-text pairs. Although scaling up such a small dataset can theoretically reduce overfitting and enhance the style transfer effect, it is insufficient to achieve the concept learning emphasized by the author. 2.The author mentions in line 50 that the adapter is fine-tuned using clearly defined pairs of content and style concepts, anticipating the learning of associations between text and image concepts. However, in the methods section, it is described merely as a data augmentation technique. The clip scores for variant (a) in Table 1 also suggest that this data augmentation technique does not significantly enhance performance relative to StyleDrop. Furthermore, the paper lacks a cost-time analysis for the concept division. 3.This paper demonstrates that the KV adapter is more effective than the feat adapter in providing fine-grained control. However, it lacks additional experimental analysis to substantiate this claim, such as visualizations of attention maps. 4.The concept pair data scaling up seems like un-necessary, which reduces the innovativeness of this paper.
问题
1.I observed that StyleDrop, as shown in Table 1, did not employ data scaling. Was human feedback omitted in this case? It seems essential to include a comparison involving human feedback for both StyleDrop and FineStyle, along with the concept scaling up discussed in this paper, to evaluate the contribution of each component. 2.I observed that FineStyle does not align with the reference style as closely as StyleDrop does. For instance, regarding color, the Eiffel Tower in FineStyle depicted in Figure 6 is blue, whereas the reference is purple. Concerning the painting style, the oil painting by FineStyle in Figure 7 is distinctly different from the reference. What could be causing these discrepancies? Could it be due to improper concept pair data scaling up techniques or the KV adapter?
局限性
The authors have already discussed limitations and societal impact in the appendix.
Thank you for your insightful and detailed reviews. We appreciate the opportunity to address the weaknesses and questions raised and to clarify aspects that may have been unclear.
W1: The pre-training and human feedback datasets exhibit significant size disparities, which leads to differing methodologies in concept learning. Pre-training with a large dataset requires a vast amount of time and resources for a careful curation. On the contrary, style fine-tuning with human feedback requires significantly less time and resources to collect a few images of the same style yet of various contents to learn a model with a custom style. Nevertheless, this requires human intervention for every new style, which could be cumbersome. We believe that these two regimes (large once-for-all fine-tuning vs a few-shot case-by-case fine-tuning) exhibit a trade-off. In this paper, we aim to match the training efficiency and remedy the limitations of human feedback dataset with proposed data-scaling and concept-weighted loss strategy.
W2: In our approach, we not only scale up the single text-image pair to be multiple pairs with each one having a concept-oriented text, but also re-weight the loss computation by concept-oriented masking, detailed in line 152. This strategy boosts the learning of associations between text and image concepts by the gradient flows stemming from concept-oriented text and masking. Utilizing masks derived from cross-attention weights naturally complements the use of a kv adapter over a feature adapter. According to Table 1, the combination of data scaling and the kv adapter yields the highest clip text scores. Concept division is efficient in both time and computation. Upon completing the writing of a prompt, a user can quickly annotate desired concept words within the text, typically requiring only a few seconds. Subsequently, the concept mask is generated by a single, zero-mask pass through the Muse model, during which the attention weights corresponding to the annotated concepts are retrieved.
W3: Theoretically, data scaling with concept-oriented masking works better with kv adapter due to the intuitions and facts: concepts are activated during cross-attention between visual and textual tokens. Furthermore, as demonstrated in Table 1, the strategy achieves the best clip text scores.
W4: Concept data scaling and concept-oriented masking are designed to complement each other effectively, as discussed in the response to W2.
Q1: In practice, human feedback is undesirable due to human intervention requirements. While it can enhance style fine-tuning, it comes at a prohibitively high cost and a risk of performance deteriorating with low-quality synthetic data. For a fair comparison, we have excluded the human feedback component of StyleDrop, recognizing that including human feedback in our comparative analysis could obscure the individual contributions of each component. In this work, our goal is to develop an efficient method that facilitates improved concept learning without the need for costly human feedback.
Q2: For Figure 6, the reference image features a range of hues from blue to purple, evident in various elements such as the bookshelf, plant leaves, pot, and laptop. This spectrum informs the lighter blue color of the Eiffel Tower in the FineStyle. While StyleDrop tends to present a darker tone, it does not capture the legit composition of the reference, particularly in the Eiffel tower example with recurring appearance of leaves.
Regarding Figure 7, a similar principle applies. The oil painting style produced by FineStyle, though different, draws from the diverse stylistic cues within the reference. Theoretically, enhancing our style dataset with a broader range of subjects and finely annotated concepts could allow for more precise control over such discrepancies through tailored prompts. This suggests a need for improved methods in explicitly regulating each concept component in our KV adapter, which could address the issues you highlighted.
For W3: I hope the authors provide visualizations to prove that the kv adapter truly focuses on the conceptual content (as mentioned in their paper) rather than performance enhancements brought about by data augmentation methods and conceptual masks. I noticed in Figure 6 that the FineStyle generated an Eiffel Tower with a blue circle even without the prompt word 'circle'; and in Figure 4, parts of a mountain appeared. However, the authors have ignored this point and have not responded to my question.
For Q1: The authors claim that a 'fair comparison' is that StyleDrop does not use data scaling while FineStyle uses data scaling. I do not think such experimental results are convincing.
For Q2: The authors claim that the color differences in Figure 6 depend on the spectrum, but for the different styles in Figure 7, they state that a similar principle applies. How does the spectrum of color apply to clearly different styles? In my view, if the effectiveness of style transfer cannot be ensured, then the significance of the proposed method is limited. Theoretically, StyleDrop could achieve fine-grained control and maintain a better standard of style transfer by more precise and carefully designed human feedback and prompts.
For Q1: In fact, the Iterative Training with Feedback in StyleDrop is another form of data scaling. As the only baseline method compared in the paper, training without feedback is highly unreasonable and leads to unfair assessments. Moreover, incorporating Iterative feedback training does not require much cost. I do not understand why the authors are unwilling to provide the corresponding experimental results.
For W3 & Q2: The authors also acknowledge their in-sufficient experiments (visual analysis for KV adapter) and inferior style transfer results. Why the authors claim in rebuttal, Q2 that 'Theoretically, enhancing our style dataset with a broader range of subjects and finely annotated concepts could allow for more precise control over such discrepancies through tailored prompts,' but then do not provide actual improved visual results? This also makes me doubt their responses.
In the rebuttal, the authors only repeatedly emphasize the contributions mentioned in their paper once and once again, but they do not address the issues I raised, which only needs very low costs and is important for this paper. Overall, I will decrease my rating.
We thank the reviewer for your additional clarification and comments and we address them as follows.
W3 (1): This work demonstrated that the KV adapter is more effective in performance but lacks visualization to prove that the KV adapter truly focuses on the conceptual content.
Sorry indeed it was our oversight and we did not address this question as requested clearly enough; our previous answer mostly focused on emphasizing the performance improvement and our intuition. We do not claim in the paper that the performance enhancement from the KV adapter necessarily has concrete causal relations with the attention weight changes of conceptual contents. Our current observation shows that there could be a strong connection but we can’t be certain that is true in all cases; but we are certain about the performance enhancement and also our intuition. We thank the reviewer for this suggestion and will try to add further ablation and visualization comparisons in the revision to study and clarify whether our KV adapter could also be a robust and visually interpretable approach besides its performance and intuition.
W3 (2) in Figure 6 that the FineStyle generated an Eiffel Tower with a blue circle even without the prompt word 'circle'; and in Figure 4, parts of a mountain appeared.
The comments regarding Figure 6 was a misunderstanding, likely due to our terse caption for Figure 6: the prompt word “circle” is actually part of the style descriptor, so the blue circle generated with Eiffel Tower is expected. We assumed the readers will refer to Figure 3 for the full prompt when checking Figure 6, and the style description is “in flat cartoon vector art inside a light blue circle on a white background.” Due to space limits, we only include incremental changes in prompt as “….eiffel tower…”. We will include these full prompts in Figure 6 in our revision to clarify this.
As for figure 4, while there seems to be a small green corner region that could be viewed either as a tree or a green forest/mountain, at viewers’ discretion, we believe it is very different from the light grayish-brown stony mountain in the original reference image, which was generated by StyleDrop. We hope the reviewer will agree with us that this could be considered as a successful example of our method, or at least an imperfect but effective example :-)
Q1: The authors claim that a 'fair comparison' is that StyleDrop does not use data scaling while FineStyle uses data scaling. The reviewer does not think such experimental results are convincing.
We did mention “fair comparison” but that was only referring to the part that we have decided to exclude the human feedback component of StyleDrop so we can better understand the difference between the two methods. For better context, when conducting this work, we discussed iterative human feedback with the authors of StyleDrop. They recognize the limitations of iterative human feedback in practice: expensive labor cost, extra annotation time, dependency on quality of synthetic images and risk of performance deteriorating due to human selection bias. The last point is also noted at the end of Sec. 3.3 of StyleDrop paper. These limitations actually motivated us to seek a way of improving fine-grained control without iterative human feedback.
As for whether it is fair to compare FineStyle (with data scaling) and StyleDrop (without data scaling), we beg to differ with the reviewer. Data scaling along with concept-oriented masking are part of the contributions of our proposed work, we compared with StyleDrop as a baseline and did not add data scaling to this baseline because that was not part of their original method. It is reasonable to assume that data scaling will also help StyleDrop and combining our proposed work with StyleDrop can potentially produce a StyleDrop V2 with fine-grained control without the need of iterative human feedback. In theory, we can also add iterative human feedback to both FineStyle and StyleDrop under controlled environment to further study the capacity of FineStyle, and we intend to explore this in our future revisions.
(to be continued)
Q2: The authors claim that the color differences in Figure 6 depend on the spectrum, but for the different styles in Figure 7, they state that a similar principle applies. How does the spectrum of color apply to clearly different styles? In my view, if the effectiveness of style transfer cannot be ensured, then the significance of the proposed method is limited. Theoretically, StyleDrop could achieve fine-grained control and maintain a better standard of style transfer by more precise and carefully designed human feedback and prompts.
By “similar principle”, we did not mean the spectrum of color apply to clearly different styles. We simply meant that similar to the example in Figure 6 that a range of different colors contributed to the fine-grained control in FineStyle , the reference image in Figure 7 features a range of fine-grained style elements (e.g. different colors, texture, shape, and other stylistic cues) and the oil painting image produced by FineStyle also draw from the diverse stylistic cues within the reference.
We understand the concern from the reviewer that it could appear to some viewers (maybe including the reviewer) that the global style seems better captured in the StyleDrop than in FineStyle in the oil painting example and hence it could imply a limitation of FineStyle in terms of style transfer compared to StyleDrop. To our defense, the perception of what is a good style transfer is quite subjective here, especially when FineStyle is focused on fine-grained style elements and StyleDrop is focused on global styles in the reference image. While our proposed method is not perfect, we believe it clearly achieved its goal of enabling fine-grained controllability with proper style preservation. However, we also acknowledge the insight from the reviewer that we could potentially further improve our method to better preserve the global styles if that is what users may desire, and we leave this to future extensions of this work.
Finally, while we beg to differ that StyleDrop is stronger than FineStyle in style transfer for the reasons we mentioned above, we agree that both StyleDrop and our FineStyle could theoretically achieve better fine-grained control with additional iterative human feedback and carefully designed prompts. This, however, again goes back to part of the motivation of our proposed work that we want to gain fine-grained control without the need for iterative human feedback, which has a set of limitations we discussed above when addressing Q1.
Overall, we thank the reviewer for sharing their insight to help us to improve this work and we hope we have clarified some of the misunderstandings and we will adopt the valuable feedback from the reviewer in our revision.
We are extremely disappointed that the reviewer VFWr has completely ignored our contributions and rebuttals while we politely and professionally pointed out the misunderstandings from the reviewer's side, including they misjudged the prompt in Figure 6 etc., and the their reading of the Figure 6 and Figure 4 etc. (which were not even so important from the beginning, because our blue is lighter than StyleDrop then our generation is worse? the prompt word "circle" was there so generating it in Figure 6 is absolutely reasonable; the small green corner of trees in the background is not the brown rocky mountain in the reference and our results are simply effective ...)
Also we want to emphasize that we never acknowledged our style transfer is inferior and that which one is better is just the difference of subjective evaluation between the reviewer VFWr and the authors, which again support our claim that iterative human feedback could be problematic as pointed out by the original StyleDrop authors. And we hope the other reviewers and AC will examine this and provide their subjective evaluation and preference as well while clearly the reviewer VFWr's subjective evaluation is not in favor of this technical work but more nitpicking on the subtle visual difference from their own perspective with misunderstanding.
We also do not agree to the unreasonable standard from the review VFWr that we have to add these attention visualization to prove that our method works: our method apparently work and it showed performance improvement, effective visual output and we carefully explained our intuition behind it. While as we agreed we could add visualization to demonstrate this point in some additional ablation, it is neither really required to prove that our proposed method work (it works!) and nor rigorous (we believe the consensus in the ML community is that visualization generally is a nice tool to have but it is often less robust and can be easily cherry-picked and have strong bias).
Removing iterative human feedback training is desiderata from the original StyleDrop work and we proposed a valid approach to make it work. We are extremely disappointed at the reviewer's response in strongly downplaying our contributions, holding our work to unreasonable standards despite their misunderstanding while dismissing our improvement over a strong prior work StyleDrop.
We've shown our best intention in the last few responses to reviewer VFWr, but we also want to remind the reviewer VFWr that we believe they don't understand the NeurIPS review standards:
"3: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and/or incompletely addressed ethical considerations." and "6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations."
We'll stop engaging with the reviewer VFWr. We've put in a significant amount of time and effort into this work and we are confident about its contributions and its applicability to users. We leave this indignant and frank feedback with the sincere hope that other reviewers and ACs will review the conversation above and judge the paper fairly in its own merit.
Thank you all! (including the reviewer VFWr for your reviewing effort, we provide this frank and fair feedback to everyone instead of just sending to AC with the hope to convey our point in the spirit that conversations in the ML community should be unbiased, professional and bi-directional. It is the only way that our community can move forward.)
Reviewer PDcY, Q5
We agree using an LLM can improve efficiency and reduce human work. Therefore we use an internal multi-modal LLM and prompt it with the image from Figure 3 and a prompt outlined below. The output shows we can use fairly simple prompt to a multi-modal LLM we can automate the original image description including the entities and the style description. This procedure generates an overall caption for the image included all entities contained in the image along with style descriptors. Then the model outputs captions for K=3 entities with the style descriptor as well. We plan to show the results of this procedure in the paper as it would eliminate any work for a human user to do and we thank the reviewer for the suggestion.
This paper presents fine-grained controllable style generation for text-to-image models. The paper is well written and the generated images are convincing. It was demonstrated that even the state-of-the-art method, StyleDrop, sufferes content leakage as shown in Fig. 2. In a few-shot learning scenario, associating visual concepts with the corresponding text phrases is a difficult problem. To solve this problem, concept-oriented data scaling, which is a new synthetic data augmentation method, is proposed as shown in Fig. 3.
Although one reviewer is not satisfied with the experimental validation, after taking all the reviewers’ comments and scores, generated images, and human evaluation into consideration, I would like to recommend acceptance of this paper.