Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models
摘要
评审与讨论
The authors propose an efficient personalization method using LoRA for on-device text-to-image diffusion models. The method relies on removing non-essential layers in the U-net, reducing GPU usage during fine-tuning.
优点
- The paper is well written.
- Discarding a few layers in the U-net copy architecture returns similar results in generation.
缺点
-
The whole paper is based on the assumption of Fig.2 that during after the fine-tuning, the external layers of the U-net are more affected than the internal one. I said assumption because I cannot find anywhere in the paper on which data they perform that statistics neither further explanations on it.
-
Based on this, the authors just optimised the external layers using LoRA without considering the internal one. The gain in efficiency shown in Table 1 is relatively small respect to the standard LoRA over the whole model, mainly because they are excluding the less computationally costly layers from the fine-tuning.
-
The idea is similar to ControlNet, https://arxiv.org/abs/2302.05543, combined with LoRA, involving the exclusion of a few layers in the cloned copy. I do not see too much novelty in it.
问题
Maybe the evaluation on 30 images is a bit too small, line 241. Can the authors evaluate on a larger dataset to compare the performance?
局限性
The output quality might be affected by the reduction of the layers. Removing the middle-layers means that you are excluding long range image editing (e.g. the background editing might be affected). A further investigation on a larger dataset might be required.
Thank you for your thoughtful feedback. Please note our top-level comment with additional experimental results. Below we address specific questions.
- Clarification on assumption and experimental details
The experiment depicted in Figure 2 of the manuscript, as detailed in Section 4.1, was conducted to identify less significant layers within the U-Net by analyzing the changes in LoRA weights before and after personalization. Specifically, we measured the extent of weight changes across different blocks of the U-Net, where smaller weight changes in certain layers after personalization indicate that these layers are less important for the personalization process. This assumption is inspired by previous work, notably Li et al. [22], Kumari et al. [7], and Shah et al. [6], which demonstrated similar analyses of weight changes.
Figure 2 of the manuscript illustrates the results obtained when personalizing the "monster toy" subject from the DreamBooth dataset, starting from the pretrained weights of the Stable Diffusion V2.1 model. We injected LoRA into various layers of the U-Net's attention and ResNet blocks, and then fine-tuned the model for 1000 steps with a learning rate of 1e-4, using a LoRA rank of 128. We observed that the overall trends in weight changes remained consistent across different learning rates, steps, LoRA ranks, LoRA injection locations, and personalizing subjects.
Additionally, to substantiate our findings, we conducted further experiments, the results of which are described in Figure 1 of the attached document. We analyzed more diverse subjects (30 subjects from the DreamBooth dataset and 101 subjects from the CustomConcept101 dataset) and across different models (Stable Diffusion V2.1 and SDXL). The x-axis of each plot represents the specific U-Net blocks, while the y-axis shows the changes in LoRA weights before and after personalization. For each dataset, we averaged the weight changes across subjects and provided error bars to indicate the statistical variance within each dataset.
Similar to the original experiment in Figure 2 of the manuscript, these extended experiments revealed that weight changes around the central blocks tend to be minimal, confirming that these layers are less involved in personalization. This suggests that our assumption holds across various personalization tasks and models. We plan to include these additional findings in the supplementary materials.
- Discussion on the efficiency of Hollowed Net
The training memory used for training LoRA is 5.226GB, and more than 65% (3.404GB) comes from holding the entire model on GPU, and our aim is to remove the part of the models during training, in order to reduce the memory consumption during training without undermining the performance.
In particular, we assume an extremely resource-constrained environment where even running an inference (costing approximately 3.49GB in our experiments) can exhaust most of the available memory. This assumption is reasonable given that the total memory capacity of recently released mobile phones (e.g., iPhone) ranges from 6 to 8GB, which must be shared with the operating system and other background apps. In such cases, a 1.35GB memory usage reduction can be substantial, directly affecting the feasibility of performing fine-tuning.
- Comparison with ControlNet
The main differences between ControlNet and Hollowed Net arise from their distinct goals. First, ControlNet focuses on adding spatial conditioning controls to diffusion models. It leverages a cloned copy to use the pre-trained knowledge of diffusion models as initialization. However, ControlNet does not address memory efficiency concerns, as maintaining both the original and cloned networks simultaneously for training and inference requires substantial GPU memory.
In contrast, Hollowed Net is designed to achieve personalized LoRA parameters in a memory-efficient manner. We are the first to introduce a two-step fine-tuning strategy that enables efficient personalization of diffusion models with memory demands as low as inference. Compared to ControlNet, Hollowed Net uses a cloned copy to ensure that the personalized LoRA parameters can be transferred to the original network. Unlike ControlNet, which employs a cloned copy to complement the original network after training, Hollowed Net utilizes the cloned copy as a proxy for training LoRA parameters, without intending for the cloned copy to be used concurrently with the original network.
- Experiments with large datasets
We agree with you for the necessity of testing with larger datasets. Following your suggestions, we include the analysis with CustomConcept101 datasets, which you can find the analysis in the top-level comment.
I would like to thank the authors for the effort in the reply and for extra evaluation on their model. Perhaps, the concept of discarding some the less computational expensive layers of the U-net for a faster inference/fine-tuning is not novel, neither impactful for this venue. The advantage in computational resources is relatively small compared with the standard LoRA fine-tuning and even if they show similar results on a relatively small dataset, the scalability is not guaranteed (the trade-off quality performance given by discarding the bottleneck layers can be meaningful on foundation generative models).
Regarding the assumption, you said: "smaller weight changes in certain layers after personalisation indicate that these layers are less important for the personalisation process". The certain layers are limited to a chunk of layers in the bottleneck of the U-net, a sparse representation of the layers across the whole model would be more interesting. Indeed, it's intuitive that if you discard the bottleneck of the U-net, you would get similar results at the cost of quality. From my point of view, the whole work converges to a simple model weight pruning based on the assumptions described above.
A way to improve the work would be to elaborate the paper such that the authors would deeply investigate the role of the u-net architecture in personalisation instead of defining it as "Hollowed Net". From my point of view, I'm not seeing the community start defining "model pruning" as "hollowed net", since "model pruning" is already known.
Given all these points, I'm going to keep the same rating.
Thank you for your thoughtful comments and for taking the time to review our paper. We have carefully considered your feedback and would like to provide some clarifications regarding our work.
1. Computational Advantage
While a reduction of 1.35 GB in GPU memory may seem modest under standard workstation settings (e.g. A100; 80GB), it represents a significant advantage in the context of on-device training (e.g. IPhone15; 6~8GB). For training large generative models with on-device chipsets, even a small reduction in GPU memory can have a substantial impact on feasibility. Even an increase in GPU memory of less than 1 GB can destabilize chipset functioning and make computation infeasible. In this context, our approach offers a crucial advantage in on-device fine-tuning of diffusion models by reducing training memory to a level as low as that required for inference, while standard LoRA consumes 35% more memory than Hollowed Net in proportional terms.
2. Novelty of the Work
We would like to emphasize that there is a clear distinction between our approach, Hollowed Net, and traditional layer pruning methods. As presented in recent work [1], accepted at ECCV’24, layer pruning involves completely removing selected layers, which necessitates extensive pre-training with large datasets to recover lost information and restore functionality. Diffusion models, however, often suffer from significant performance degradation post-pruning, as lost information may not be fully recoverable through pre-training.
In the table below, we present experiments with layer-pruned Stable Diffusion models by recent work (BK-SDM [1]), accepted at ECCV’24, using rank-128 LoRA. Compared to the results in Table 1 of our paper, these models achieve memory usage comparable to Hollowed Net but show significant performance degradation, generating malformed images for subjects and prompts outside the pre-training data. Despite the extensive pre-training, the performance of these models is compromised.
| Method | # of Parameters | Model Size | Training Memory | DINO | CLIP-I | CLIP-T |
|---|---|---|---|---|---|---|
| BK-SDM-Base | 595.7M | 2.272GB | 3.546GB | 0.629 ± 0.012 | 0.788 ± 0.007 | 0.3 ± 0.001 |
| BK-SDM-SMALL | 496.2M | 1.893GB | 3.133GB | 0.602 ± 0.013 | 0.774 ± 0.008 | 0.298 ± 0.001 |
In contrast, Hollowed Net does not completely remove the middle layers and does not require any extensive pre-training. We temporarily exclude the selected layers during fine-tuning, while preserving essential information from those excluded layers by incorporating an additional buffering stage. Despite this buffering, the computational load for the entire training process of Hollowed Net (2052 TFLOPs) remains more efficient than LoRA fine-tuning (2148 TFLOPs).
Our approach of excluding layers for fine-tuning is a novel technique beyond traditional layer pruning, and merits further discussion within the community. In fact, Meta AI has recently released an approach for LLMs, shortly before our submission [2], which suggests that for fine-tuning LLM models, one can remove up to 40% of middle layers of the models and perform fine-tuning, and still achieve the comparable results. However, their approach differs from ours due to the characteristics of LLMs versus diffusion U-Nets. Meta AI's method involves complete removal of middle layers for both fine-tuning and inference, arguing that these layers do not store critical knowledge compared to shallow layers.
In contrast, our research shows that the deep layers of diffusion U-Nets contain crucial high-level image features, and their complete removal can lead to severe performance degradation, even with additional pre-training (as discussed for the case of [1] above). This underlines the significance of our two-stage fine-tuning strategy, which maintains performance while reducing memory usage without requiring additional pre-training.
[1] Kim, Bo-Kyeong, et al. "Bk-sdm: A lightweight, fast, and cheap version of stable diffusion." arXiv preprint arXiv:2305.15798 (2023).
[2] Gromov, Andrey, et al. "The unreasonable ineffectiveness of the deeper layers." arXiv preprint arXiv:2403.17887 (2024).
3. Discussion on Layer Selection
Regarding layer selection, we do not base our choices on computational expense of certain layers. Instead, we leverage the skip connections inherent in the U-Net architecture to determine which layers to remove. For our main results, we choose the third layer of the second down block, the entire third down block, the entire mid block, and the entire first up block to be hollowed during fine-tuning, which corresponds to 40% of the U-Net's parameters.
Our findings suggest that removing those layers during fine-tuning is feasible because these deep layers are more involved with high-level abstract features (e.g., the concept of a dog) rather than pixel-level personalization details of a specific instance (e.g., “a [V] dog”), which leads to relatively smaller change of weights in LoRA parameters for personalization in those layers (commonly observed for different models and datasets as presented in the pdf of the top-level comment). This characteristic allows us to use intermediate activations from the frozen original network, acquired with a class-level prompt (e.g., “a dog”) during the collection stage, to update the layers of Hollowed Net during the fine-tuning stage.
4. Scalability of the Work
To demonstrate the scalability of Hollowed Net, we present experimental results with CustomConcept101, the largest dataset available for T2I personalization, and results with SDXL, one of the largest publicly available diffusion models (included in the top-level comment and pdf), as well as user studies to validate the consistency of metrics with human evaluation (included in the comment for Reviewer SH8P). These extensive results consistently show that Hollowed Net does not require any trade-off between performance and computation efficiency compared to LoRA, for the removal of up to 40% of parameters. Our approach maintains the same or even slightly better performance compared to LoRA, while achieving significant computational advantages for on-device learning. As our novel two-stage fine-tuning strategy allows to maintain the knowledge of (temporarily) hollowed layers, it enables effective personalization for over 100 subjects with various prompts, including long range image editing like background editing and many others.
We hope this detailed explanation addresses your concerns regarding the novelty and effectiveness of our work. We are open to further discussion and clarifications if needed, and we sincerely hope these points will contribute to a reconsideration of the final rating.
Thank you for your time and consideration.
This paper proposes a novel framework named “Hollowed Net” for memory-efficient LoRA fine-tuning on the personalization task, showing its potential for on-device personalization tasks with limited resources. With the hollowed UNet structure, the training memory requirement is largely reduced while the performance maintains at the same level. In the meantime, the memory cost during inference is unchanged since the hollowed net is a sub-network of the original UNet and can be directly transferred back. Besides, the proposed method is flexible and scalable to different model architectures with controllable hollowed layers.
优点
Improved Efficiency: As shown in Table 1, the hollowed net reduced training time memory cost from 5.23GB to 3.88GB while maintaining competitive quantitative performance.
Well-motivated: The observation that middle layers of the UNet contribute less during fine-tuning in the personalization task directly inspires removing them to save memory, and the paper proposes a feasible way to do it.
缺点
Experimental evaluation to be improved: the paper uses DINO and CLIP scores for quantitative evaluation, however these metrics are sometimes considered less reliable because they could be inconsistent with human preference (e.g., [1][2]). Similar works have been using user studies with detailed configurations to demonstrate quantitative performance (e.g., [1][2][3][4]), which is missing in this paper.
The method took a step forward towards better efficiency, but more analysis on model size and latency will be needed to claim “on-device personalization”, since the current experiment settings still seem to be large for real on-device use cases, e.g., 1000 steps with 3.88 GB. For example: How much memory is required for the data buffer? Would the observation/motivation hold for other smaller/faster models like quantized/flow models?
[1] Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, et al. “StyleDrop: Text-to-Image Generation in Any Style”. In: 37th Conference on Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems Foundation. 2023.
[2] Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. "RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control." arXiv preprint arXiv:2405.17401 (2024).
[3] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500-22510. 2023.
[4] Nataniel, Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. "Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6527-6536. 2024.
问题
- How is Hollowed net compared with a LoRA model that does not finetune all the attention layers?
- In figure 1 (top), the output of 5th block is conditioned on "a dog", while in figure 1 (bottom), the condition is "a [v] dog". In this way, the input from data buffer and conditional input/skip connection to block 6 is not consistent. Will this be a problem in training? e.g., requires a carefully selected class token as mentioned in the conclusion
- How is the analysis in Section 4.1 conducted? Is it only an example on one subject in dreambooth dataset? If different personalization tasks shared similar observations, can you elaborate?
- In figure 3, how is the red arrow from block2 to block3 used?
局限性
The authors have added justifications for limitations and societal impact of their work, both of which are reasonable. It would be better to include what potential social impact would be for improving memory efficiency for on-device personalization.
Thank you for your thoughtful feedback. Please note our top-level comment with additional experimental results. Below we address specific questions.
- Experimental evaluation with user studies
Based on you suggestion, we conducted user studies with 40 participants, each completing a set of 25 comparative tasks. For each task, participants were presented with one reference image, one prompt, and two generated images (A and B). They answered two questions: subject fidelity and text fidelity. Each pair of generated images, A and B, was created using Hollowed Net and LoRA FT, and the labels (A or B) were randomly assigned for each task. The table below displays the results of the user studies. These results confirm that users generally find the images generated by Hollowed Net and LoRA FT to be similar in terms of both subject fidelity and text fidelity, consistent with the main results presented in Table 1 of the paper.
| Hollowed Net | Tie | LoRA FT | |
|---|---|---|---|
| Subject Fidelity | 31.2% | 49.3% | 19.5% |
| Text Fidelity | 18.1% | 69.4% | 12.5% |
- More analysis on model size and latency
We evaluated our HollowedNet based on the Stable Diffusion v2.1 with 866M. We think that our fine-tuning approach of 1000 steps and 3.88GB given this backbone is an on-device-feasible solution with the following reasons:
2.1) [Memory] Recent mobile phones can support 8GB or 12GB memory (iPhone15 or GalaxyS23 Ultra), and thus the training memory 3.88GB can fit into this space.
2.2) [Latency] Though we couldn't measure the actual latency on a mobile device, we expect the latency based on the previous report. For example, the Table 1 in [1] shows around 7s-15s for an inference on a GalaxyS23 device. Compared to an inference, we observed that our fine-tuning needs 2.2x more computations than an inference (6.013 TFlops vs 2.717 TFlops). Based on this, for a single training step, roughly we can say it takes around 22 sec (2.2times x 10sec/inference). So, for 1000 steps, it would take 6 hours, and this can be done during charging time at night. Furthermore, combining with a HyperNetwork which provides initial LoRA parameters, we may also expect 40 times reduction in training steps (as reported in the Hyper-dreambooth paper), and this leads to less than 10min solution.
[1] Choi, Jiwoong, et al. "Squeezing large-scale diffusion models for mobile." arXiv preprint arXiv:2307.01193 (2023).
- Memory for data buffer
For collection, 200 buffered samples are sufficient for achieving high-fidelity results Each set of pre-computed outputs requires approximately 1.35 MB. Therefore, MB of additional storage is needed for keeping all data required later during fine-tuning. Whether all this data is held in RAM simultaneously rather than SSD will depend on the specific implementation.
- Working with quantized/flow models
We believe that there is no clear reason why Hollowed Net cannot be applied to quantized/flow models, but some modifications may be necessary depending on the technical differences in the model architecture.
- LoRA model that does not finetune all the attention layers
The table below shows the results of selectively applying LoRAs to the layers, excluding those removed in the Hollowed Net. We find that selectively applying LoRAs leads to a slight decrease in performance. This may be due to the cross-attention layers in the middle receiving a personalization prompt (e.g., "a [V] dog") but not learning how to process the unique token "[V]". More importantly, in terms of memory efficiency, using fewer LoRAs results in only a minimal decrease in training memory. As shown in the table, the main advantage of the Hollowed Net is its ability to avoid holding the hollowed layers in GPU memory during fine-tuning, resulting in a 35% reduction in GPU memory usage compared to LoRA.
- Discussion on conditional input/skip connection
How to handle inconsistency and leverage it to produce better personalization results is what the model learns during fine-tuning. The deep layers in the middle are more involved with high-level abstract features (e.g., the concept of a dog) rather than the pixel-level personalization details of a specific instance (e.g., a [V] dog). Thus, at this level, as long as the categories are well-defined, the deep features of a specific instance can be replaced with the deep features of its corresponding class (e.g., a dog) through proper fine-tuning. In fact, both the results in Table 1 of the paper and the user study presented above show that Hollowed Net can perform slightly better than LoRA fine-tuning in some cases. We assume this is because using class-level inputs for the middle layers can result in positive regularization for effective personalization to a certain extent. This suggests that the resolution of defining class tokens can lead to better or worse results compared to LoRA.
| Method | # of Parameters | Model Size | Training Memory | DINO | CLIP-I | CLIP-T |
|---|---|---|---|---|---|---|
| LoRA r=128 | 892.5M | 3.404GB | 5.226GB | 0.658 ± 0.001 | 0.806 ± 0.005 | 0.299 ± 0.002 |
| LoRA r=128 (No middle) | 889.9M | 3.395GB | 5.199GB | 0.654 ± 0.011 | 0.801 ± 0.004 | 0.301 ± 0.002 |
| Hollowed Net | 550.9M | 2.102GB | 3.875GB | 0.660 ± 0.011 | 0.805 ± 0.006 | 0.300 ± 0.001 |
- Expanded explanation of analysis on LoRA weight changes
Please refer to the answer for the first question of the last reviewer: Jp6t.
- In figure 3, how is the red arrow from block2 to block3 used?
Thank you for your keen observation and a thorough review. The red arrow from block2 to block3 should be removed.
Thanks for the clear explanantions from the authors, my questions are mosty resolved, though more analysis and on-device numbers would be preferred in a paper focusing on on-device contribution.
In A6 you mentioned:
"In fact, both the results in Table 1 of the paper and the user study presented above show that Hollowed Net can perform slightly better than LoRA fine-tuning in some cases. We assume this is because using class-level inputs for the middle layers can result in positive regularization for effective personalization to a certain extent."
In my understanding, Hollowed Net doesn't use class-level inputs to tune the middle layers (compared with LoRA) while still having slightly better performance, why is it? I don't follow why Hollowed Net have better performance than LoRA FT since it tunes a subset of LoRA weights.
We sincerely appreciate the time and effort you have invested in providing such a detailed review. Your feedback has been instrumental in refining our work, and we are pleased to hear that most of your concerns have been addressed.
Regarding the on-device numbers, we regret that time constraints prevented us from conducting additional experiments with actual on-device chipsets. However, we are committed to including a comprehensive analysis regarding on-device numbers in the final version of the paper. In the meantime, we have provided a table below that details the computational loads and memory consumption for each stage of training and inference for both Hollowed Net and LoRA. We hope this offers more clarity, and we will expand on this analysis in the final manuscript (Please refer to the rebuttal comment for Reviewer fGHD for more detailed analysis).
| Methods | LoRA | Hollowed Net | ||
|---|---|---|---|---|
| FLOPs | GPU Memory | FLOPs | GPU Memory | |
| Collection | - | - | 0.238T | 2.138GB |
| Fine-tuning | 2.148T | 5.226GB | 2.004T | 3.875GB |
| Inference | 0.716T | 3.597GB | 0.920T | 3.588GB |
Regarding your additional question, we apologize for any confusion caused by the phrase “using class-level inputs for the middle layers”, which we acknowledge is misleading. Allow us to clarify our assumptions about the potential regularization effect in Hollowed Net.
As discussed in [1], fine-tuning a model to replicate a given input image using an instance-level prompt (“a [V] dog”) can lead to language drift and reduced output diversity. To address this issue, [1] introduces class prior preservation loss, a regularizer that trains the model to reproduce a dog image generated with a frozen original U-Net using a class-level prompt (“a dog”). We believe that the architecture of Hollowed Net shares some similar aspects. Specifically, Hollowed Net uses a frozen original U-Net to generate intermediate features with the prompt (“a dog”). These intermediate features are then used as intermediate inputs during the fine-tuning of Hollowed Net. As the model is trained to leverage the class-level features of “a dog” from the original U-Net to generate a specific image for “a [V] dog”, it can retain knowledge about the class prior and integrate it with the subject instance's specific features. This characteristic may explain why Hollowed Net can produce slightly better results than LoRA in some cases.
We hope these clarifications and additional details address your concerns and contribute to a more favorable assessment of our work. If you have any further questions or need additional clarification, please let us know.
[1] Ruiz, Nataniel, et al. "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
My questions are clear now, I will raise the score.
We're pleased our responses addressed your concerns! Thank you for raising the score and for your valuable contributions to this discussion.
This work proposes Hollowed Net, a parameter-pruning method for training subject-driven synthesis models under limited computing resources.
优点
According to the experimental results, applying Hollowed Net will reduce the memory costs during training stages, while keeping similar performance.
缺点
- In on-device cases, the computing time is always another key issue. However, it seems that Hollowed Net needs extra forward and buffering stages (first stages in the paper) during both training and inference periods. I wonder how much computational overload and extra time will be caused by these stages.
- Since this work is mainly focusing on space consumption, it is also necessary to evaluate the memory usage of the buffering stages. Even if these stages may take RAM instead of VRAM, it is also important for on-device computation since RAM is usually limited as well. Besides, does the Peak Usage reported in Tab. 1 include the buffering stage?
- It is necessary to provide the results with LoRA rank between 1 and 128, for both plain LoRA and Hollowed Net. To the best of my knowledge, using rank 1 may be inadequate, while using rank 128 may be excessive for subject-driven synthesis, since the official examples of the diffusers library suggest rank 4 for DreamBooth [A].
- The hollowed part in Hollowed Net cuts the gradients from backpropagating from up blocks to down blocks of the UNet. Providing more discussion on the effects of these gradients will make the design of Hollowed Net more reasonable. Besides, another possible design is to remain the hollowed part (without LoRA either) in forwarding, while stopping the gradients to go through this part. I suppose such design may get rid of the extra buffering stages in both training and inference periods.
- From a perspective of higher level, the idea of Hollowed Net may be applicable to many other tasks of finetuning relatively large models. The authors are encouraged to do exploration on such direction in the future.
[A] https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py
问题
In general, the idea of pruning less important parameters is a promising topic on finetuning large models. However, I suggest further polishing w.r.t. the issues above to make this work better.
局限性
Limitations have been discussed in the paper. The authors have claimed in the checklist that there is no societal impacts of this work.
Thank you for your valuable feedback. Please note our top-level comment with additional experimental results. Below we address specific questions.
| Methods | LoRA | Hollowed Net | ||
|---|---|---|---|---|
| FLOPs | GPU Memory | FLOPs | GPU Memory | |
| Collection | - | - | 0.238T | 2.138GB |
| Fine-tuning | 2.148T | 5.226GB | 2.004T | 3.875GB |
| Inference | 0.716T | 3.597GB | 0.920T | 3.588GB |
- Further discussion on computational overload
In the table above, we provide a detailed analysis of computational load (FLOPs) and memory usage (peak GPU memory) for each stage of Hollowed Net and LoRA. Each number corresponds to one step of each stage: one forward pass for Collection and Inference, and one forward+backward pass for Fine-tuning.
For fine-tuning, 1000 steps are required, totaling TFLOPs of computation. For Collection, 200 buffered samples are sufficient for achieving high-fidelity results, which requires TFLOPs of additional computation. Thus, the total computation required for training with Hollowed Net is TFLOPs, which is lower than the TFLOPs needed for fine-tuning LoRA.
For inference, Hollowet Net requires around 28% more computation than LoRA, as it needs to repeat a part of early downblocks to reproduce the path used in training. Considering that recent works show that on-device inference of Stable Diffusion models can be done within 7-15 seconds, those additional computational cost for inference may not be a significant issue, considering the advantage of significantly lowering the memory cost of fine-tuning to a feasible level.
However, if inference latency is a concern, a variant of the inference method for Hollowed Net can be considered. This variant involves a single inference path with LoRA, where the cross-attention layers hollowed during training receive a non-personalized prompt (e.g., "A dog") as input. As shown in the table below, this variant (Single Path) requires the same amount of computational FLOPs as LoRA but achieves a level of personalization equivalent to that of the default inference method (Dual Path). One potential explanation for the effectiveness of this efficient method is that the high-level features in the middle layers are focused on processing abstract class-level characteristics of an image rather than capturing personalization details at the pixel level. Therefore, even if the middle layers are fed features of a personalized dog (in Single Path) instead of a non-personalized dog (in Dual Path) from the early layers, the extraction of high-level features for "A dog" is not affected.
| FLOPs | DINO | CLIP-I | CLIP-T | |
|---|---|---|---|---|
| Hollowed Inference (Dual Path) | 0.920T | 0.660 ± 0.011 | 0.805 ± 0.006 | 0.300 ± 0.001 |
| Hollowed Inference (Single Path) | 0.716T | 0.660 ± 0.011 | 0.802 ± 0.006 | 0.301 ± 0.001 |
- Further discussion on space consumption
The Peak Usage reported in Table 1 of our paper represents the maximum GPU memory (VRAM) consumption during the fine-tuning stage. As shown in the table above, the VRAM usage during the Collection stage is significantly lower because only part of the upper blocks needs to be retained. Each set of pre-computed outputs requires approximately 1.35 MB. Therefore, MB of additional storage is needed for keeping all data required later during fine-tuning. Whether all this data is held in RAM simultaneously rather than SSD will depend on the specific implementation.
- Results with LoRA rank between 1 and 128
In the table below, we present the results using LoRA with different ranks (4 and 16). While the default rank of 4 in the diffusers library is often used, we have found that it may oversimplify personalization details or fail to effectively handle a range of challenging subjects and prompts. Increasing the rank from 4 to 16 improves subject fidelity. However, to achieve personalization quality comparable to full fine-tuning across all subjects and prompts in the DreamBooth dataset, we find that a rank of 128 is necessary.
| Method | # of Parameters | Model Size | Training Memory | DINO | CLIP-I | CLIP-T |
|---|---|---|---|---|---|---|
| LoRA r=4 | 866.7M | 3.306GB | 4.847GB | 0.564 ± 0.014 | 0.766 ± 0.006 | 0.311 ± 0.001 |
| LoRA r=16 | 869.2M | 3.316GB | 4.883GB | 0.618 ± 0.008 | 0.788 ± 0.005 | 0.305 ± 0.001 |
| Hollow r=4 | 527.7M | 2.013GB | 3.526GB | 0.566 ± 0.009 | 0.763 ± 0.003 | 0.311 ± 0.001 |
| Hollow r=16 | 529.9M | 2.021GB | 3.558GB | 0.626 ± 0.009 | 0.789 ± 0.005 | 0.305 ± 0.001 |
- Discussion on the design of Hollowed Net
This is indeed a very interesting suggestion. Modifying the gradient flow by stopping gradients from passing through the hollowed part could replicate backpropagation in Hollowed Net while maintaining the hollowed layers. However, the primary advantage of Hollowed Net is that it avoids holding these hollowed layers in GPU memory during fine-tuning, which saves approximately 1.29GB of additional memory. This reduction in memory usage is a significant benefit, making it challenging to eliminate the extra buffering stages.
Thanks for the author response, as it has settled most of my concerns.
In the reply to my first two questions, the authors mentioned that 200 buffered samples are sufficient for achieving high-fidelity results. But I am not pretty sure of it, since each sample is of specific input image, specific noise to be added, and specific timestep. If the authors argue that 200 samples of tuples (image, noise, timestep) are enough, some ablation studies on this quantity will definitely support this statement. I am not requiring the results of these experiments during this discussion period, yet the authors are strongly encouraged to include them in a future version.
Also, since other reviewers also raised concerns w.r.t. computational load and space consumption, the authors may consider delivering more discussion on this part.
I will consider raising my rating accordingly in the final recommendation.
Thank you for considering raising our score. We greatly appreciate your thoughtful feedback and are pleased to hear that most of your concerns have been addressed. Your insights are invaluable, and we are taking them very seriously as we continue to refine our work.
To further clarify and resolve your remaining concerns, we would like to provide additional results and explanations regarding the high-fidelity outcomes achieved with a smaller number of samples. When evaluating our method using 200 buffered samples, we achieved the following metrics: DINO: 0.659 (±0.011), CLIP-I: 0.803 (±0.006), and CLIP-T: 0.300 (±0.001). These results are highly comparable to those of Hollowed Net and LoRA FT, as presented in Table 1 of the paper.
To explain the minimal effect of using fewer samples, consider collecting 200 samples with a subset of timesteps (e.g., 1, 6, 11, ..., 991, 996). Updating with 200 randomly shuffled samples at a batch size of one would constitute one epoch of 200 steps. Conducting 1000 fine-tuning steps would amount to five epochs. However, it is important to note that in many instances and prompts, 1000 fine-tuning steps are not strictly necessary, as the model often begins to show high-fidelity results after 200-500 steps. Moreover, using a subset of timesteps does not inherently have a negative impact on overall performance. In fact, recent work [1] has shown that sparse updating, utilizing only 1-10% of timesteps, does not degrade performance. In some cases, it even enhances visual quality, presumably due to the regularization effect of sparsity. We believe these considerations contribute to why Hollowed Net's performance remains comparable to LoRA FT even with fewer samples.
Besides, please note that Hollowed Net still requires less computational effort for the entire training process than LoRA FT when using up to 600 samples. In our final version, we plan to include extensive ablation studies on the impact of varying the number of samples on both quantitative and qualitative results, similar to the ablation study on fractions of hollowed layers in Section 5.3 of our paper.
Furthermore, we will provide a detailed analysis of computational loads and space consumption in the final version, including a breakdown of FLOPs, VRAM, and RAM analysis for different numbers of collected samples and different fractions of hollowed layers, which will enable users to choose the optimal configurations of Hollowed Net according to their specific resource constraints.
Once again, we sincerely appreciate your valuable feedback and consideration. We hope this additional information addresses your concerns and aids in your final recommendation.
[1] Lu, Haoming, et al. "Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Thanks for the additional information. I will take it into account as well.
Thank you for your kind response. We are sincerely grateful for the time and effort you dedicated to thoroughly reviewing our paper and participating in the discussion. We sincerely hope this additional information addresses your concerns and contribute to a more favorable assessment of our work. We will carefully consider all the feedback you provided during the rebuttal and discussion periods as we update the paper to enhance its completeness. Once again, thank you very much for your thoughtful and thorough review.
The paper introduces Hollowed Net, a novel approach for on-device personalization with T2I LDMs. The model gives better memory efficiency by modifying the U-Net to remove less-essential deep layers. This helps with the on-device memory constraints. It reduces memory usage during fine-tuning while maintaining personalization performance. The paper provides quantitative and qualitative analyses to demonstrate its effectiveness and discusses potential applications and scalability.
优点
The paper focuses on reducing memory usage by modifying the U-Net architecture to remove non-essential deep layers. The paper addresses memory efficiency and demonstrates reduced GPU memory while maintaining personalization performance. The paper provides quantitative and qualitative results to show its effectiveness. The paper discusses potential applications and scalability.
缺点
- In table 1, a 1.~GB memory usage save, while beneficial, is still marginal compared to LoRA, especially if the overall computational efficiency does not significantly surpass those of existing methods.
- As mentioned by the author, methods like SVDiff optimizes singular values of the weight matrix, leading to an even smaller model checkpoint. Is there a comparison with other memory-efficient approaches like this?
- The idea of removing less-important unet layers seems to be a one-time optimization, because it relies on identifying and justifying certain layers as less important compared to others. This might not be scalable/generalizable across different models and tasks, as the importance of different layers could be different, depending on the application or dataset.
问题
Is there a comparison with other memory-efficient approaches besides LoRA?
局限性
The authors have discussed the limitations of their work, along with potential negative societal impacts.
Thank you for your thoughtful feedback. Please note our top-level comment with additional experimental results. Below we address specific questions.
- 1.GB memory usage save, while beneficial, is still marginal
In this paper, we are assuming an extremely resource-constrained environment where even running an inference (costing approximately 3.49GB in our experiments) can exhaust most of the available memory. This assumption is reasonable given that the total memory capacity of recently released mobile phones (e.g., iPhone) ranges from 6 to 8GB, which must be shared with the operating system and other background apps. In such cases, a 1.35GB memory usage reduction can be substantial, directly affecting the feasibility of performing fine-tuning.
The strength of Hollowed Net lies in its ability to enable model fine-tuning with memory usage similar to or even less than that required for inference, depending on the proportion of hollowed layers. Furthermore, in proportional terms, applying Hollowed Net results in a 35% memory reduction compared to LoRA, which is a significant improvement. In addition, more memory reduction can be achieved with more hollowness depending on the usecases and tasks. We believe this substantial reduction is crucial in environments with limited resources and makes our approach highly beneficial.
- Comparison with other memory-efficient approaches
Unfortunately, SVDiff does not have an official code release. We attempted to use a third-party reproduced code from GitHub, but it was not implemented in a memory-efficient manner, requiring over 14GB of peak GPU memory for processing a single image.
Theoretically speaking, the mechanism of SVDiff is not significantly different from LoRA, in terms of memory usage. As described in the paper, SVDiff uses 2200 times fewer parameters compared to vanilla DreamBooth, which suggests that SVDiff updates around 391K parameters. This is nearly double the number of parameters updated by rank-1 LoRA, presented in Table 1 of our paper. Consequently, we estimate that the minimal memory usage achievable by SVDiff would be at least 1GB more than Hollowed Net. In fact, all fine-tuning methods based on reducing updating parameters like SVDiff have a fundamental limitation: they can reduce the number of updated parameters but cannot reduce the actual model size, which takes a significant portion of memory usage when performing backpropagation over entire backbone network. In this context, rank-1 LoRA demonstrates the minimal baseline of memory usage that these methods can achieve.
Another memory-efficient approach to consider is model compression, which reduces the backbone model's size through pre-training with a large dataset. We present results using BK-SDM, a recent method accepted at ECCV '24, in the table below. We fine-tuned the compressed Stable Diffusion models of BK-SDM using rank-128 LoRA on the DreamBooth dataset. Our experimental results show that while these models achieve memory usage comparable to Hollowed Net, they suffer from significant performance degradation. These models' capacity is limited to the data used during model compression, and for subjects and prompts outside this data, they can even generate seriously malformed images.
In contrast, Hollowed Net achieves both low memory usage, comparable to compressed models, and high-quality results, similar to full fine-tuning of diffusion models. This is achieved through our novel approach, which effectively removes layers during training while preserving the knowledge from those hollowed layers throughout both training and inference stages.
| Method | # of Parameters | Model Size | Training Memory | DINO | CLIP-I | CLIP-T |
|---|---|---|---|---|---|---|
| BK-SDM-Base | 595.7M | 2.272GB | 3.546GB | 0.629 ± 0.012 | 0.788 ± 0.007 | 0.3 ± 0.001 |
| BK-SDM-SMALL | 496.2M | 1.893GB | 3.133GB | 0.602 ± 0.013 | 0.774 ± 0.008 | 0.298 ± 0.001 |
- Discussion on scalability and generalizability
To demonstrate the scalability and generalizability of our method, we include additional results and analysis in top-level comments. Please also refer to the answer for the first question of the last reviewer: Jp6t.
Dear Reviewer cT56,
As the discussion period draws to a close, we are mindful of the limited time available to address any further questions or requests for clarification. To ensure we provide all necessary information for your final recommendation, we would like to further clarify our previous rebuttal, particularly regarding scalability and generalizability, which we referenced other comments we made but did not directly address.
To address your concerns on the scalability and generalizability of Hollowed Net, we have presented experimental results with SDXL, one of the largest publicly available diffusion models, and results with CustomConcept101, the largest dataset available for T2I personalization (included in the top-level comment and pdf), as well as user studies to validate the consistency of metrics with human evaluation (included in the comment for Reviewer SH8P). Across all these experiments, Hollowed Net consistently demonstrates its ability to enable effective personalization for over 100 subjects with various challenging prompts of different applications including background editing, artistic rendition, accessorization, and property modification. This performance is comparable to Full / LoRA fine-tuning of diffusion models, while significantly reducing the required training memory.
Our findings suggest that Hollowed Net’s approach of excluding middle layers during fine-tuning is feasible because these deep layers are more involved with high-level abstract features (e.g., the concept of a dog) rather than pixel-level personalization details of a specific instance (e.g., “a [V] dog”), which leads to relatively smaller change in LoRA parameters during personalization in those layers, a trend commonly observed across different models and datasets (as presented in the pdf of the top-level comment). This characteristic allows us to use intermediate activations from the frozen original network, obtained with a class-level prompt (e.g., “a dog”) during the collection stage, to update the layers of Hollowed Net during the fine-tuning stage.
As elaborated in our rebuttal above, Hollowed Net is a novel method that overcomes the fundamental limitation of fine-tuning methods based on reducing updating parameters, such as SVDiff, which cannot reduce the actual model size that requires a significant amount of GPU memory during fine-tuning. While a reduction of 1.35 GB in GPU memory may seem modest under standard workstation settings (e.g. A100; 80GB), it represents a significant advantage in the context of on-device training (e.g. IPhone15; 6~8GB). For training large generative models with on-device chipsets, even a small reduction in GPU memory can make a substantial impact on feasibility. Even an increase in GPU memory of less than 1 GB can destabilize chipset functioning and make computation infeasible. In this context, our approach offers a crucial advantage in on-device fine-tuning of diffusion models by reducing training memory to a level as low as that required for inference, while standard LoRA consumes 35% more memory than Hollowed Net in proportional terms.
We sincerely appreciate the time and effort you have invested in reviewing our work. We hope these clarifications and additional details address your concerns and contribute to a more favorable assessment of our work.
Dear Reviewer,
We sincerely appreciate the reviewers for dedicating their time and effort to review our work. We have conducted multiple sets of experiments and studies to address all feedback and questions from the reviewers. Please refer to the attached document for a one-page analysis.
In order to show the scalability and generalizability of Hollowed Net approach, we include the experimental results with larger dataset, called CustomConcept101, which includes 101 different subjects with 15 different large categories and 20 unique prompts for each category. We conduct evaluation following the same protocols described in our paper for the DreamBooth dataset. As presented in the table of the top-level comment, the experimental results demonstrate that Hollowed Net consistently achieves the comparable results with full/LoRA fine-tuning of the diffusion models. Qualitative results can be found in the pdf, and the results show that our method works very for different instances with various challenging prompts of different applications including background, artistic rendition, accessorization, property modification, and etc.
| # of Parameters | Model Size | Training Memory | DINO | CLIP-I | CLIP-T | |
|---|---|---|---|---|---|---|
| Full FT | 865.9M | 3.303GB | 16.623GB | 0.605 ± 0.005 | 0.773 ± 0.006 | 0.302 ± 0.002 |
| LoRA FT | 892.5M | 3.404GB | 5.226GB | 0.603 ± 0.008 | 0.773 ± 0.005 | 0.302 ± 0.002 |
| Hollowed Net | 550.9M | 2.102GB | 3.875GB | 0.603 ± 0.007 | 0.773 ± 0.005 | 0.302 ± 0.002 |
We also use the SDXL models and apply the Hollowed Net by removing the entire layers in the "mid_block", containing 410M parameters. We include the example qualitative results in the pdf. The results show that Hollowed Net achieves the high-fidelity personalization results compared to LoRA.
Additional experiments are included in the rebuttal comments for each corresponding reviewer.
The paper proposes a new approach for hardware-constrained personalization. Three out of four reviewers ranked the paper positively, recognizing the novelty of the proposed approach. They highlighted that the reduction in memory is significant, and importantly, does not come with a drop in performance. There was a fruitful discussion between reviewers and authors, in which, according to AC's understanding, most of comments were addressed. One reviewer provided a strongly negative rating. Compared to a typical review, the AC found their review somewhat short. The authors tried to address their concerns and answer their questions, by providing further discussion and experiments. The reviewer, however, remained unconvinced and kept the negative score. The AC read their comments and authors' responses, and believes that the authors addressed their concerns to a reasonable extent. Hence, the decision is to recommend acceptance. Congrats!