5.5

/10

Poster4 位审稿人

最低4最高7标准差1.1

4.0

置信度

正确性2.8

贡献度3.0

表达3.0

NeurIPS 2024

FouRA: Fourier Low-Rank Adaptation

Shubhankar Borse,Shreya Kadambi,Nilesh Prasad Pandey,Kartikeya Bhardwaj,Viswanath Ganapathy,Sweta Priyadarshi,Risheek Garrepalli,Rafael Esteves,Munawar Hayat,Fatih Porikli

OpenReview PDF

提交: 2024-04-30更新: 2024-11-06

摘要

关键词

Low Rank AdaptersFourier TransformGenerative Models

评审与讨论

审稿意见

评分: 4置信度: 52024-06-24

This paper presents a PEFT method (mainly for text-to-image tasks) called FouRA. FouRA learns the LoRA projection in the frequency domain. This idea helps solve the problems of data copying and distribution collapse and thus improves the generated image quality. The effectiveness of FouRA is verified on both CV and NLP tasks.

优点

The paper is well-organized.
The introduction of the FouRA method is clear and easy to follow.
Among the issues that this article focuses on, adaptive rank selection is an important issue in the field of PEFT.

缺点

First, ranking has nothing to do with importance.

Efficiency is an important property that PEFT methods should have. Compared to LoRA, FouRA introduces additional computational operations, where multiple 1D-DCT transform will be involved in Eq.1. For every token, FouRA requires to perform this transform. Will these 1D-DCTs take too much time? Can the authors show the time required for each training epoch?
Also, for GPU cost, can the authors please report GPU peak memory required during fine-tuning? Comparison of both time and GPU cost can follow two fair settings. (a) FouRA and LoRA achieve similar accuracy, (b) FouRA and LoRA have (roughly) the same rank.
In my opinion, except for the methods section, the paper is not very easy to follow. The main reason is that the authors attempt to claim too many arguments, but not all problems are fully analyzed and solved. Generally, the three core points of this article are (a) the Fourier low rank adaptation, (b) adapter rank selection strategy and (c) enabling flexible mixure of multiple adapters. The following 4, 5, 6 are my questions about these three points.
About Fourier. First of all, I don't quite understand how to get $\Delta W_{foura}$ from Eq.5, can the authors provide a derivation? For Lemma 4.1, my understanding is that $\Delta W_1$ and $\Delta W_2$ are actually two potential fitting targets, and you can judge the error of LoRA's r-rank approximation by their eigenvalue distributions. In short, these two variables are approximate targets rather than the fine-tuning results of LoRA or FouRA. So what is the specific meaning of the eigenvalues calculated in Figure 4? There should be a simpler and less confusing way to verify the error of the FouRA approximation, such as fitting a random target matrix (see Figure 6 in the Vera[1] paper) or designing some simple classification tasks (see Figure 7 in the FourierFT[2] paper) with your method .
Adaptive Rank Selection. I reserve my opinion on flexibility. Increasing flexibility may not always lead to high generalization, and may even make convergence difficult (the author could provide a comparison of the convergence speed of fine-tuning with and without the gating module). The claim about flexibility (i.e., input dependent selection) is too strong without evidence or reasonable intuitions. On the contrary, the data-agnostic selection paradigm is probably more concise and elegant because we do not need to learn selection strategies on new data sets. If the authors insist on making this claim, then they can show sufficient experimental results, such as data-dependent selection is better than selection that relies only on the model. In addition, it seems that no ablation study results on rank selection were found.
Multiple Adapters. Not sure what the purpose of section 3.5 is. I understand that the PEFT method has many scenarios and therefore has many excellent properties that should be met. However, it seems that many important metrics are not evaluated, such as efficiency, the number of trainable parameters, and the storage memory occupied by the adapter, etc.
It is recommended that authors focus their writing on the text-to-image task. Although there are experimental results on GLUE, this does not seem to be sufficient to verify that FouRA is a general PEFT method. If the main claim of the FouRA paper is to propose a PEFT method for text-to-image generation, I personally believe it will be more readable and the contribution will be more prominent.

[1] VERA: VECTOR-BASED RANDOM MATRIX ADAPTATION. ICLR 2024.

[2] Parameter-Efficient Fine-Tuning with Discrete Fourier Transform. ICML 2024.

问题

In Line 248, is "RoBERTA-Base" a typo? The results in Table 3 are more like the performance of using RoBERTa-Large. Moreover, can the author provide the code or demo only for reproducing the result (70.6) on the CoLA dataset? It would be cool if one could reproduce this result with FouRA regardless of whether you use base or large RoBERTa models.

局限性

None.

作者回复

2024-08-07

We appreciate reviewer 4zX8 for their detailed feedback and an in-depth review to helped us improve our work.

Training Time: We provide detailed analysis of training time per epoch in Table R.1. One training epoch takes 24.5s (for FouRA with inference adaptive masking) compared with 22s (for baseline LoRA), by keeping the rank fixed across two methods. We will add this analysis to the paper.

GPU Memory: Thanks for the suggestion. We report peak memory usage in Table R.1. We further analyze performance with varying training complexity (training time, memory usage) in Figure R.1. To vary time, we report HPS scores of FouRA v/s LoRA at intermediate epochs. To vary the memory, we use rank. We observe that FouRA consistently achieves better performance v/s compute operating points compared to LoRA.

About Fourier: Similar to $\Delta{W_{lora}}$ , $\Delta{W_{foura}}$ is defined as the weight projection of the second term in Eq.5. The term $\mathbf{G}$ is the output of $\mathcal{G}$ . We will clarify this in the text. We also want to clarify in Sec. 4.1 that $\Delta{W_{2}} = \mathcal{F}^{-1}BA\mathcal{F}$ and $\Delta{W_{1}} = BA$ are FouRA(no-mask) and LoRA trained weights, not potential fitting targets. The singular value spread in Figure 4 is of a low-rank approximation of both these trained matrices, following prior works [1, 2]. It can be inferred from [1, 2, 3] that the compactness in eigen-spread proves the capability of FouRA adapters over LoRA in generating lower errors when rank is reduced. It also shows that the frequency domain can learn richer information given a sparsity constraint. We also show in Appendix B.3. how FouRA learns representations which are more de-correlated from base weights.

Our analysis above was on trained weights of the diffusion model, and did not require a toy task. Per your suggestion, we have conducted an analysis on fitting the MNIST task, comparing the training loss of FouRA (without gating) and LoRA layers. Figure R.3 shows results for two ranks. Train params are equal for both adapters. As observed, Fourier domain training leads to lower errors compared to LoRA. We also find that with reduced rank, the gap between LoRA and FouRA widens.

Adaptive Rank: We provide intuitions for our proposed adaptive gated rank selection algorithm in Sec. 4.2 of the paper. Adding to this, we argue that input dependent rank selection is advantageous as it not only selects rank, but also specific vectors in the low rank subspace. The ideal vector directions in the low-rank subspace vary with inputs having different characteristics e.g., certain vectors will be sensitive to specific frequencies, and we argue that our proposed input-based selection algorithm finds optimal vectors as compared to a frozen dynamic gating function [4]. In a diffusion model for instance, at varying diffusion timesteps (corresponding to different levels of input noise), optimal vectors vary based on their sensitivity to the noise. We also analyze this intuition in Fig.5 by plotting effective rank across denoising unet, across timesteps. Observe that the learnt effective rank reduces as the diffusion process concludes, meaning lesser noisy inputs are sensitive to lesser vectors. Similarly, higher input resolution ideally requires a higher number of vectors (hence the higher effective rank at up.3 and down.0 blocks in the diffusion Unet). We provide ablation studies in R.2 (and Fig.9 of text) to empirically validate this motivation, showing that FouRA with adaptive masking outperforms FouRA with frozen masking. Finally, from your suggestions, we also plot the training curves for Fourier v/s Fourier+gating in Fig.R.4, justifying that speed of convergence is not affected by gating.

Merging: Sec. 3.5 motivates the use of FouRA adapters in merging two adapters as compared to LoRA. Please see Appendix B.3.1 and B.4, as they are important analysis conducted to demonstrate that FouRA learns representations which have a higher likelihood of being disentangled between two FouRA adapters, compared to LoRA. This property proves to be critical in adapter merging, as FouRA can generate images which successfully retain capabilities of both adapters during adapter merging. We also observe higher amplification of subspaces not emphasized by base frozen model in Table B.2. This is important as FouRA is a training-free approach to improve the merging capabilities of low-rank-adapters, providing great flexibility over contemporary works which propose joint training methods for this orthogonalization of subspaces. Please see the results in Figure 7 and Section 5.2 and 5.3 for adapter merging. These are all from a training-free merge, which is a simple arithmetic add. Additional analysis we conducted show that scaling up the number of trainable params in LoRA to match FouRA does not affect performance, and FouRA continues to outperform LoRA with a similar delta. Thanks for proposing it, we will include this study.

Generalizability: We agree that the results on GLUE tasks might not be sufficient. Hence, we have performed further analysis on eight commonsense reasoning benchmarks, using Llama3-8B as our backbone. These results in Table R.3. show that FouRA with r=16 and r=32 outperforms LoRA at r=32, suggesting FouRA as a generalizable PEFT method. Additionally, while we agree that FouRA was originally motivated for text-to-image models, we believe its unique aspects such as compactness in the Frequency domain and adaptive rank are generalizable across domains. We prefer reporting GLUE results to show generalizability across multiple tasks. Having said this, we are open to move the GLUE results to the Appendix and reorganize the paper to further explain the benefits of FouRA in text-to-image tasks.

Q1: We thank you for pointing this out. Indeed, the line 248 is a typo. We use DeBERTaV3-base as the backbone from [4]. Appendix C has implementation details and Appendix I has a code snippet.

审稿意见

评分: 7置信度: 42024-07-08

This paper address a fundemental diversity limitation of any LoRA fine-tuned diffusion model. More specifically, we can observe distribution collapse with these fine-tuned models in the setting of limited data. The authors propose to address this problem by applying LoRA in the frequency domain. The fourier transform provides a disentangled orthogonal basis, which is a more suitable space for low-rank adaptation, especially in the diffusion setting.

优点

The method is very well motivated and directly addresses a main limitation of LoRA in the generative setting. There are many qualitative and quantitative ablations showing the superiority of FouRA v.s. vanilla LoRA. The disentangled low-rank space is clearly very effective for concept sliders.

It is very interesting to see FouRA does not degrade in performance when generalised to language tasks too. The authors also average over 3 seeds for these runs.

缺点

My only concern is with the additional memory overheads induced by having to perform the forward and inverse fourier transform. The primary practical interest of LoRA is to make fine-tuning large models possible on lower-grade GPUs, with lower memory. To me, parameter efficiency alone is more of a theoretical interest. I can see the authors have shown that the training time is not much higher than vanilla LoRA, however I would like to see the memory overhead and how this scales with the batch size.

small points/spelling:

The main contribution and focus of this paper is on diffusion/generative models. Although the authors do show generality to discriminative tasks, I think it may make more sense to have "diffusion models" or some variant in the title.

L77 "denosing" L113 "gradined" L699 "Computaional"

问题

L888: where are these numbers coming from (1.15, 8.0, 2.3, 4.15)? are the results very sensitive to these parameters and are they used for all experiments presented here?

局限性

It would have been nice to see a commitment to open sourcing an official implementation, but other than this yes, the authors have adequately addressed all the impacts and limitations of their work.

作者回复

2024-08-07

We appreciate reviewer yqCK for their meticulous review and insightful feedback, helping us improve our work.

Memory Overhead/Scaling with batch size: Thank you for raising the point on memory. We provide details including memory overhead in Table R.1 of the rebuttal pdf. The reported numbers in Table R.1 are for a batch size of 8. Further, we report the scaling based on batch size in the following table:

Batch Size	8	6	4	2
LoRA	53687 MB	40872 MB	28151 MB	15499 MB
FouRA	53894 MB	41020 MB	28255 MB	15448 MB

We can observe that the FouRA GPU memory overhead during training time is negligible and only 0.3-0.4% over LoRA. We will include our analysis in the paper.

Title: Thanks for the suggestion. We agree with your observation and will reflect these in the title (if the platform allows it). Having said this, we also show that FouRA is a generic approach which works on non-diffusion models, e.g., it shows benefits over LoRA on Commonsense reasoning in Table R.3 of the pdf as well as on GLUE benchmarks in Table 3 of the paper. On the commonsense reasoning in Table R.3 for instance, FouRA-LLama3(8B) model achieves and average of 85.3% accuracy, compared to the 82.9% achieved by the LoRA-LLama3(8B) model.

Minor: We highly appreciate your meticulous review of our work and have corrected these mistakes.

Question on L888: As discussed in Appendix C, we adopt an entropy-based gating approach to train the soft gating module as in prior work [7] (see global response for references). The numbers in question are derived from their code [7] and are consistent across all datasets/models. We use them for all experiments with adaptive gating. They act as temperature terms to scale the sigmoid function. Our analysis shows that the model isn't sensitive to these terms as we threshold the sigmoid output.

2024-08-11

The authors have addressed my only main concern with this paper. I have looked through the other reviewers comments and I will maintain my original score.

2024-08-14

Thank you so much for your feedback, timely response and final recommendation. It has helped us improve the quality of our work.

审稿意见

评分: 6置信度: 32024-07-11

The authors propose FouRA, a novel low-rank adaptation for pretrained diffusion models that can successfully handle data copying and distribution collapse problems observed in previous works. FouRA performs low-rank adaptation in the frequency domain and incorporates input-dependent adaptive rank selection during inference by the help of learnable gating function. The authors show FouRA learns decorrelated projections which is effective when merging multiple concepts of adapters. The paper demonstrates the superiority of FouRA through extensive experiments and analysis.

优点

The proposed FouRA, which applies low-rank adaptation in the frequency domain with input-dependent rank selection, is well-motivated and novel.
The proposed FouRA-trained multiple adapters can be combined without further training and produce better-quality images than LoRA adapters.
The authors support their claim thoroughly with extensive experiments and analysis throughout the paper, which makes their work solid. The experimental results are convincing.

缺点

One favorable property of LoRA is that it can be merged into the pretrained weights, due to its linearity. If my understanding is correct, the proposed FouRA cannot be merged with the base models’ weights due to the intermediate gating function, which will consequently increase the latency of the model. Authors only provide training time in computational analysis in appendix, and I am curious how FouRA would affect the overall inference time.
It seems the ablation study of each component of FouRA is missing. Further study would help readers to understand how each component affects the performance of FouRA. Also, direct comparison between FouRA and FouRA with fixed dynamic rank would further highlight the efficacy of proposed adaptive rank gating method.

问题

Please see the weaknesses.

局限性

I do not see any serious societal impact in this submission.

作者回复

2024-08-07

We thank reviewer AmEw for their constructive feedback and acknowledgement of our motivation/novelty.

Inference time: Thanks for suggesting the inference time analysis. As requested, we show the inference latency along with other compute analysis in Table R.1 of the provided pdf file. We observe that FouRA with dynamic frozen masking has same inference time (14.9 steps/sec) as baseline LoRA (after merging adapter into weights), while achieving better visual generations i.e., HPS score of 30.3 (FouRA with dynamic frozen masking) vs 27.7 (LoRA). While FouRA with inference adaptive rank selection incurs more inference latency (11.1 steps/sec), it does achieve best visual quality i.e., HPS score of 30.6. We will include this trade-off analysis in the paper.

Ablation on individual components of FouRA: Thank you for bringing this up. As suggested, we show individual contributions from FouRA modules in Table R.2 of the pdf. We fix rank=64 and $\alpha$ =0.8, and provide results on the paintings validation set. As evident from LPIPS-Diversity and HPS scores, the adaptive mask selection strategy performs better than the dynamic fixed mask selection strategy. For the case without frequency transform, Inference-Adaptive masking improves the HPS score from 28.2 to 28.7. When accompanied with Frequency transform, the HPS increases from 30.3 for frozen dynamic masking to 30.6 for inference-adaptive masking. These improvements are similar to those shown on the blue-fire validation set in Appendix E.1. We will add the ablation study in Table R.2 with the full breakdown in our main paper.

审稿意见

评分: 5置信度: 42024-07-15

This paper proposes a new parameter-efficient fine-tuning method that operates in the frequency domain, termed FouRA. Specifically, The method operates in the frequency domain, learning low-rank adapter transforms to Fourier-transformed input features. It also incorporates an adaptive rank selection strategy that can vary during both training and inference. The authors provide theoretical analysis and extensive experimental results across multiple tasks, demonstrating FouRA's effectiveness in text-to-image generation, concept editing, and language understanding.

优点

Overall, this paper is well-written. All the contents are organized properly. The proposed method is described clearly with details.
The idea of operating in the frequency domain is novel. It provides a reasonable way to interpret the learned LoRAs and to control the generated images.
The authors provide theoretical analysis and proofs for their claims, including lemmas on singular value decomposition and sparsity.
The paper includes pretty comprehensive experimental results on multiple tasks, including text-to-image generation, image editing, and language understanding. They compare FouRA to existing methods like LoRA and provide both quantitative and qualitative results.

缺点

While the qualitative results are appealing, it could be great to include more quantitative evaluation and more baselines. The improvement in GLUE tasks is not that significant.
The paper does not provide a detailed analysis of the computational overhead of FouRA compared to LoRA. While there is a brief mention in the appendix, a more thorough discussion would be beneficial.
Minor: Please consider enlarging the fonts in Figure 3.

问题

The proposed method effectively addresses LoRA's "data-copying" phenomenon. I am wondering whether this "data-copying" effect is caused by the overfitting of LoRAs and can also be eliminated by early stopping.

局限性

There is no societal negative impact.

作者回复

2024-08-07

We appreciate reviewer R4Sn for their insightful feedback to help us improve our work.

Quantitative Results: Thank you for the suggestion. Based on your recommendation, we provide more quantitative analysis in the Rebuttal pdf. We have trained FouRA adapters over a LLaMA3-8B model and tested on eight publicly available commonsense reasoning tasks, following the split from [5] and implementation from [6] (see global response for references). Our method outperforms LoRA scores across all benchmarks in both rank=32 and rank=16 settings, summarized in Table R.3 of the rebuttal pdf.

Compute Analysis: We have conducted a more in-depth analysis of both training and inference time, along with the gpu memory in Table R.1 of the provided pdf. In summary, for each training epoch, FouRA needs 24.5s vs. 22.0s by LoRA, and GPU memory consumption of FouRA and LoRA are comparable. We have also provided measurements for inference time in the table. Additionally, we provide training complexity v/s performance curves for FouRA and LoRA in Figure R.1 of the rebuttal pdf. From these results, it is clear that FouRA can provide a higher operating point in the performance v/s computational tradeoff as compared to LoRA. We will include this analysis to our paper.

Minor: Thank you for the suggestion, we will enlarge the fonts in Fig. 3 of the main paper.

What causes data-copying? Thank you for bringing this up. We have provided experimental analysis to answer this question. Please refer to Figure R.2 of the pdf. We track the LPIPS-diversity as a measure of data-copying and HPS-v2 scores as a measure of adapter quality. We do notice lesser data copying artifacts in the initial phase of training. However, the adapter quality and strength are sub-par due to inadequate training (i.e. the style is not visible in the image). This is visible in HPS-v2 alignment scores. The images produced are similar to those from the base model, and hence lesser artifacts exist. As the training epochs increase, images start to represent the adapter style (represented by HPS scores). Once we reach this point, the number of data-copying artifacts increase significantly in LoRA, as tracked by the LPIPS-diversity. FouRA can achieve the adapter style while being able to produce a diverse range of images, as seen in Fig.1 of the main text. We also observe this trend when we visualize images from intermediate epochs. We will include these results in our appendix.

作者回复

2024-08-07

We appreciate all the reviewers for providing insightful reviews, which has truly helped us improve our work. We provide a single-page PDF including tables and figures to supplement our response to reviewers’ comments.

Reviewers largely acknowledged multiple aspects of the paper such as “paper is well-written”, “comprehensive experimental results” R4Sn. “well-motivated and novel”, “extensive experiments”, “results are convincing”. AmEw “method is very well motivated”, “many qualitative and quantitative ablations”. yqCK. “well-organized”, “clear and easy to follow” 4zX8.

Multiple reviewers raised questions relating to compute overhead (during training/inference) introduced by FouRA compared with baseline LoRA. To address these concerns, we have now provided an in-depth analysis of computational and runtime complexity of our method both at training and inference in Table R.1 of the rebuttal pdf. In summary, for each training epoch, FouRA needs 24.5s vs. 22.0s by LoRA, and peak memory consumption of FouRA and LoRA are comparable. Additionally, we provide training complexity v/s performance curves for FouRA and LoRA in Figure R.1. From these results, it is clear that FouRA can provide a higher operating point in the performance v/s computational tradeoff as compared to LoRA.

Another common question from the reviewers (R4Sn and 4zX8) is on insufficient quantitative backing of FouRA as a general PEFT method, due to the fewer experiments on GLUE benchmark in the main paper. To address this concern, we provide additional experiments on eight commonsense reasoning benchmarks in Table R.3 with Llama-3 backbone. Our results show clear benefits of FouRA compared with LoRA in terms of performance and complexity scores.

We have addressed clarification questions raised by reviewers under their respective individual responses. The list of references mentioned in all individual reviewers’ responses is provided below.

References (from all individual responses):

[1]Zeng, Yuchen, and Kangwook Lee. "The expressive power of low-rank adaptation." arXiv preprint arXiv:2310.17513 (2023).

[2]Eckart, Carl, and Gale Young. "The approximation of one matrix by another of lower rank." Psychometrika 1.3 (1936): 211-218.

[3] Jun Zhang, Yixin Liao, Xinshan Zhu, Hongquan Wang, and Jie Ding. A deep learning approach in the discrete cosine transform domain to median filtering forensics. IEEE Signal Processing Letters, 27:276–280, 2020.

[4] Ding, Ning, et al. "Sparse low-rank adaptation of pre-trained language models." arXiv preprint arXiv:2311.11696 (2023).

[5] Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., and Lee, R. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.

[6] Liu, Shih-Yang, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. "Dora: Weight-decomposed low-rank adaptation." arXiv preprint arXiv:2402.09353 (2024).

[7] Garg, Prachi et al. "Memorisation-and-Generalisation-in-Deep-CNNs-Using-Soft-Gating-Mechanisms".

最终决定Accept (poster)

2024-09-25

The paper introduces a novel approach that enhances the diversity and generalization of fine-tuned text-to-image diffusion models by applying Fourier domain projections and adaptive rank selection. This method addresses key issues in existing LoRA techniques, such as data copying and distribution collapse. The proposed approach is theoretically sound and backed by rigorous analysis, which adds credibility to the authors' claims. While a number of issues are raised by the reviewers, such as the lack of detailed analysis on the computational overhead and potential latency due to the intermediate gating function, the authors have sufficiently addressed them during rebuttal. Thus, I believe the overall contributions are significant. and thus I recommend accepting this paper.