PaperHub
5.5
/10
Poster4 位审稿人
最低5最高7标准差0.9
7
5
5
5
4.5
置信度
正确性2.8
贡献度3.0
表达3.5
NeurIPS 2024

Expanding Sparse Tuning for Low Memory Usage

OpenReviewPDF
提交: 2024-05-08更新: 2024-11-06
TL;DR

A PEFT method leverage the high performance of sparse tuning with LoRA-level memory footprints.

摘要

关键词
sparse tuningparameter-efficient fine-tuningvision model fine-tuning

评审与讨论

审稿意见
7

This paper proposes a method called SNELL for vision model fine-tuning. It extends matrix decomposition from a kernel perspective and designs a novel sparsification mechanism for end-to-end parameter selection. Experimental results show that the proposed SNELL achieves low memory requirements and high performance on downstream tasks simultaneously.

优点

  1. The proposed method is novel and can achieve low memory requirements and high performance concurrently, which is valuable for practical applications. This is particularly noteworthy as many existing methods primarily focus on reducing tunable parameter volumes rather than memory usage.
  2. The proposed method can perform the data-dependent parameter selection and tuning in low-resource scenarios. I think this approach has the potential to advance the study of other areas, for example, model editing.
  3. This paper conducts extensive experiments, especially on large-scale vision models including ViT-L and ViT-H, demonstrating the high performance and memory efficiency of the proposed SNELL.
  4. This paper extends LoRA with a kernel perspective for sparse tuning. The kernel perspective contributes to new insights into LoRA and promotes its further improvements.
  5. This paper is well-written and easy to follow.

缺点

  1. It would be better to introduce the utilized kernel function in the method section for the first time rather than in the experiment section.
  2. The authors should mention in the paper that the implementation details of Figure 4(a) are located in Appendix A.3, for readers to easily access the specific details of this figure.
  3. For Figure 3(b), authors should explain why the memory savings of SNELL are more significant for larger models.
  4. Although the paper includes comprehensive experiments, I still want to know the comparison between the proposed SNELL and other visual PEFT methods, such as MOSA [1] and VQT [2].

[1] Zhang et. al., “MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning”, Arxiv:2312.02923.

[2] Tu et. al., “Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning”, CVPR 2023.

问题

Please check detailed comments in the Weaknesses part.

局限性

NA

作者回复

We would like to thank you for the detailed comments! We will diligently follow your guidance to further improve our work and manuscripts.

W1: The utilized kernel function.

Thank you for your suggestion! We will introduce the utilized kernel function in Section 3.2 in the revised paper.

W2: Implementation Details of Figure 4(a).

Thank you for your advice! We will introduce the location of implementation details in the caption of Figure 4(a) in the revised paper.

W3: Why memory savings of SNELL are more significant for larger models?

  1. The memory saving mechanism: As Fig. 3(c) shows, the memory savings of SNELL are attributed to the reduction of learnable parameters stored in the optimizer. For a matrix WRn×n\mathbf{W}\in \mathbb{R}^{n\times n}, the memory usage of SNELL-rr only takes a proportion of 2nr\frac{2}{n}r to the full fine-tuning.
  2. Memory significance for large models: As the model size increases, the value of nn and the number of weight matrices increase, making the parameter reduction becoming increasingly significant.

In the revision, We will introduce the memory-saving mechanism of SNELL in the Appendix.

W4: Comparison with MoSA and VQT.

Thank you for mentioning the two methods. We follow your suggestion and further conduct comparisons with MoSA and VQT on the VTAB-1K dataset.

MethodNaturalSpecializedStructuredMean Acc.
MoSA79.984.050.371.4
VQT72.784.549.368.8
SNELL-882.085.761.676.4
SNELL-1682.486.161.776.7
SNELL-3282.786.161.876.9

The results demonstrate the superiority of our SNELL over these methods, and we will add these comparisons in the revision.

评论

Thank the reviewers for the rebuttal. After a careful reading of the author's response and the other reviewers' comments, my concerns have been clearly addressed. I think that the paper does have merits and I am recommending the acceptance of this paper.

评论

Dear Reviewer, thank you very much for your support and recognition. We will follow your guidance to refine our manuscript.

审稿意见
5

This paper focuses on the sparse tuning task of the PEFT methods, by introducing the kernel function modeling applied into the LoRA two low-rank matrices or adapters, which costs low memory usage and leads to better performances. To achieve this target, a Kernelization for LoRA is proposed to map two low-rank matrices dot-production from a lower rank r to a implicit higher rank d, such that leading to stronger model capacity based on relatively marginal memory cost. Further, a soft mask adaptive strategy is used to implement the sparsification for fine-tuning so as to reduce the memory usage. Extensive experiments are presented to validate both the memory efficacy and performances improvements.

优点

  1. Mapping the low-rank matrices dot-production into an implicit space using conventional kernel function sounds like a promising direction, since some typical machine learning methods help with the representation ability of the model while reducing marginally small computation cost.

  2. The experimental results are extensive and the ablative studies showcase convincing result.

  3. The performances look superior to most of previous works, including both the adapter methods and weights decomposition methods.

  4. The presentation and writing of this paper are easy to follow.

缺点

  1. The Competition-based Sparsification Mechanism is kind of overclaim by this paper, because the primary idea of this soft-thresholding is widely utilized to re-activate or de-activate some neurons or bits so that leading to sparsification. But competition-based is not really demonstrated on this point, neither showing competition among each bits of the mask nor clear explanation of the ∆W computation. Can you make a better clarity?

  2. For the memory usage comparisons among LoRA, Adapter and SNELL, it seems SNELL displays little advantages over these works. I suppose that the improvements comes from the kernel function modeling as this implementation enable the model with stronger capacity to adapt for downstream task, such as the non-linear modeling ability, etc. Can the author support more comprehensive explanation and experimental comparisons?

  3. Some mistaken typos are presented in the paper writing. In Line6, 'accompanied by increases in' should be 'accompanied by increasing in'.

问题

  1. Among Line47-48, I am confused about fine-tuning implementation that storing only the tunable weight into the optimizer using such as Pytorch is still practical. Hence, we may not need to put all the pretrained and frozen weights into the specific optimizer during the training phase. Can the authors make a clear explanation about this part?

The other questions please refer to the weaknesses above.

局限性

Overall, the presentation of this paper and the extensive experiments of this paper sounds promising. However, I still have the confusions above in terms of the motivation or the actual contribution of the Sparsification part.

I will keep watching the responses from the authors to evaluate if my concerns are well addressed.

作者回复

Thank you very much for your valuable comments that help us provide better clarification and explanation of our work! We hope the following responses can address your concerns.

W1: Competition-based Sparsification Mechanism Clarification

Thank you for your feedback. We now provide a more detailed explanation of the competition in sparsification regarding intuition and computation.

  1. The Primary idea of the "competition" is the dynamic threshold rather than a soft-threshold function. Specifically, we utilize the quantile of values in ΔW\Delta W as a dynamic threshold to control sparsity. This ensures that only a fixed proportion of weights remain non-zero. Therefore, the weights have to "compete" with each other to be selected instead of just having a larger value than a threshold.

  2. Competition in the computation. In the forward computation of ΔW\Delta W, weights are zeroed out based on their ranks in the competition. However, the competition actually happens in backward computation where weights are updated based on gradient descent. Weights that contribute more to the task will gain more significant value and thus win the competition.

In the revision, we will incorporate the above clarifications into Section 3.2.

W2: Comparison between SNELL and LoRA

Thank you for your valuable suggestions that help us better express our novelty.

Compared to LoRA, our improvement is a significant increase in performance. We provide comparisons between SNELL and LoRA in terms of performance and memory usage.

  1. Our Performance Improvement over LoRA primarily arises from the combination of kernelized LoRA and the sparsification mechanism.

    • LoRA with kernel functions achieves performance improvement (KLoRA in Table A8).
    • LoRA with sparse tuning shows performance degradation (Table 4(a)).
    • SNELL with both kernel functions and sparse tuning achieves better performance than LoRA (Tables 1, 2, and 3) and KLoRA (Table 4(b)). This is because the higher ranks of the weight matrix in KLoRA better support the sparsification process (L166-174).
  2. Comparable Memory Usage. For memory reduction, SNELL shares the same mechanism as LoRA, which only stores the small low-rank matrices as learnable parameters (in Figure 3(c) and L275-278). The little memory increase of SNELL compared to LoRA is because of the nonlinear kernel functions. We provide a memory usage comparison of SNELL and LoRA. The impact of the kernel function on memory usage becomes increasingly negligible as the model size grows.

    Pre-trained ModelMem. LoRAMem. SNELLMem. Delta / Mem. LoRA
    ViT-B/16154616730.082
    ViT-L/16432545190.045
    ViT-H/14932596920.039

In the revision, we will incorporate the above explanation and comparison in the experiment section.

W3: Typos

Thank you for your advice! We will change Line 6 from "by increases" to "by increasing" in the revision.

Q1: High Memory Usage of Previous Sparse Tuning Methods

Thank you for your careful reading of our paper, we will further clarify L47-48.

  1. Indeed, Pytorch supports storing only learnable parameters in the optimizer, but these learnable parameters must be stored in the form of structured matrices.
  2. Unfortunately, sparse tuning involves selecting a variable number of parameters at different locations in the matrix, which is unstructured and makes it impractical to store only the tunable parameters in the optimizer.
  3. Thus, current methods have to treat the entire weight matrix as a learnable matrix and incorporate a mask into the gradient during backpropagation. This strategy solely implements sparse updates of the matrix and is not superior to dense updates in terms of memory usage.

In the revision, we will emphasize the relation between the unstructured nature of sparse tuning and its high memory usage.

评论

Somehow the authors have addressed my concerns. Therefore, I maintain my initial score.

审稿意见
5

This paper proposes a method called SNELL for achieving sparse tuning of pre-trained models with low memory usage. The authors employ LoRA to reduce the number of learnable parameters in the optimizer and utilize kernel tricks to ensure the merged matrix maintains a high rank. Additionally, during the sparse training phase, the authors introduce a competition-based sparsification mechanism to avoid the storage overhead of weight indices. Experiments on two benchmarks demonstrate that this method achieves leading performance while reducing memory usage.

优点

  1. This work introduces SNELL, a sparse tuning method based on LoRA, which combines the memory efficiency of LoRA with the performance of sparse tuning.
  2. Experiments conducted on two benchmarks show that SNELL consistently outperforms LoRA and compares memory usage with the baseline. The authors also conduct comprehensive experiments on different pre-training methods, model architectures, and component ablations.
  3. The paper is well-organized and well-written, making it easy to follow.

缺点

  1. The idea of this paper isn't that novel. The overall approach of this work is a combination of existing fine-tuning techniques, LoRA and sparse tuning, with improvements to LoRA relying on existing kernel tricks. The competition-based sparsification mechanism merely sets a soft threshold.
  2. As a work on sparse tuning, it’s really a pity that this paper does not compare with the latest pure sparse tuning work, GPS[1], nor is it mentioned in the related work section. This severely limits the contribution of this work. In fact, comparing with the state-of-the-art sparse tuning work is essential. If the sole aim is to reduce memory usage at the expense of performance, it may not be very meaningful.
  3. As a PEFT method, the paper does not report the number of trainable parameters, citing the difficulty of calculation due to the sparse tuning method. This is unreasonable to me, as other sparse tuning methods (e.g., GPS) have reported this metric. This metric is also an important evaluation standard for PEFT.
  4. Generally, reparameterization-based methods refer to those where the additional modules can be reparameterized into the original model after training, such as LoRA and SSF[2] (which is also not compared with SNELL). In contrast, methods like Bitfit, Partial, and sparse tuning are not in this category and are referred to as specification-based methods in reviews like Delta-Tuning[3].

[1] Zhang Z, Zhang Q, Gao Z, et al. Gradient-based Parameter Selection for Efficient Fine-Tuning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 28566-28577.

[2] Lian D, Zhou D, Feng J, et al. Scaling & shifting your features: A new baseline for efficient model tuning[J]. Advances in Neural Information Processing Systems, 2022, 35: 109-123.

[3] Ding N, Qin Y, Yang G, et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models[J]. arXiv preprint arXiv:2203.06904, 2022.

问题

  1. During the experiments, was there a performance comparison with sparse tuning applied to the entire pre-trained model?
  2. The paper mentions that the competition-based sparsification mechanism can save memory usage for weight indices. What proportion of the total learnable parameters does this memory usage represent?
  3. In the sparse tuning part, have you tried fixing the sparse matrix during training to evaluate its effect?

局限性

The paper mentions that to achieve sparsification, the re-computation of the LoRA merged adaptation matrix ΔW can lead to reduced computational efficiency. Although the authors explain that designing appropriate GPU operators can solve this problem, this may not be feasible in practice or may incur higher costs. If this approach can solve the problem, then the high memory usage issue of sparse tuning itself could also be addressed similarly.

作者回复

Thank you very much for your feedback, it has been very insightful for our work. Before addressing your questions, we would like to clarify some misunderstandings regarding Weaknesses 1, 3, and Limitation. All tables are presented in the PDF response due to the limitation of character number.

W1: Novelty of SNELL

We want to clarify that our contributions, inspired by the insights on neurodynamic systems and neuroscience, are essentially different from previous methods.

  1. New perspective on LoRA. Inspired by the neurodynamic system (L75, L173-175), we interpret matrix vectors as coordinates in a dynamical space and endow the space with complex distances to improve expressivity. The complex distance can be measured by kernel functions. Therefore, our contribution is providing a new perspective for modeling weight matrix during fine-tuning, far more than just using kernel tricks. In addition, we introduce nonlinearity to LoRA, representing a fundamental departure from other linear extensions [R1].
  2. Novel sparsification is inspired by the neuron competition phenomenon (L64-66). Instead of a soft threshold function, we propose the dynamic threshold, a method that selects weights based on their ranking and induces weight competition during end-to-end fine-tuning. This distinguishes our approach from others which select weights based on their values and employ a soft threshold as a parameter [R2] or increase using prior rules [R3].

In the revision, we will detail the theory behind SNELL and clarify our differences from other methods.

W3: Tunable Parameters Volume

We acknowledge the importance of this metric and will try our best to estimate it. Before reporting estimations, we want to stress that the difficulty in computing this metric does not stem from the sparse-tuning but from the combination of LoRA and sparse-tuning (L644-656).

  1. Difficulty of computing this metric: In (kernelized) LoRA, this metric counts elements in the two low-rank matrices that are used to recover weight matrices. However, the sparsification in SNELL requires tracing from the recovered matrix to the low-rank matrices and locating the sparsified elements, which is difficult.
  2. Volume Estimation: Please see Table R1 in the response pdf for details.
  3. Memory usage is more demanding when fine-tuning large models. A PEFT method can require fewer tunable parameters but still more memory usage, such as pure sparse-tuning with larger sparsity compared to SNELL.

Limitation: Applicability of SNELL and Problems in Pure Sparse-tuning.

  1. SNELL is applicable in practice: SNELL only exhibits a very marginal increase compared to LoRA in training time (Table A9), which does not impede its practical application. This is proved by our implementation of SNELL on large models (ViT-L&ViT-H) in Table A8. Appropriate GPU operators can further benefit the SNELL's training speed.
  2. Pure sparse-tuning CANNOT achieve LoRA-level memory usage by recomputing during back-propagation: The low-rank matrices in the optimizer allow SNELL to recompute the large merged matrix during the back-propagation without saving them in the optimizer. In contrast, pure sparse-tuning methods without low-rank matrices have to consistently store large weight matrices in the optimizer.

W2: Comparing with GPS

Thank you for mentioning the important work to us. Regrettably, we did not mention GPS in the submitted draft, partially due to the unavailability of its source code at the submission deadline.

With the published codes, we now provide fair comparisons with GPS in performance (Table R2 in the response pdf) and memory usage (Table R3 in the response pdf). SNELL has performance on par with GPS while requiring lower memory usage.

In the revision, we will discuss GPS as the latest SOTA sparse-tuning method and conduct more comparisons with GPS.

W4: Method Taxonomy

This question is interesting and prompts us to discuss the taxonomy of PEFT methods. We acknowledge the lack of a standard taxonomy in the community. We follow the taxonomy of SPT[R4], classifying based on whether using additional modules during inference or not. This differs from Delta-tuning[3] according to the training mechanism. Additionally, we compare SNELL with SSF[2] (Table R4 in the response pdf). SNELL demonstrates better performance.

In the revision, we will provide a detailed discussion of this taxonomy and add the comparison results.

Q1: Sparse-Tuning without KLoRA

Your question is insightful and has inspired us to explore the relationship between low-rank properties and performance. Please see Table R5 in the response pdf for comparisons between SNELL and pure sparse-tuning. SNELL achieves better performance than pure sparse-tuning.

Q2: Parameter and Memory Savings of Our Sparsification

We want to clarify that our saved memory usage in weight indices does not reduce the tunable parameter volume. These indices are saved as constant float tensors (not parameters) to multiply with gradients during fine-tuning. Compared to the pure sparse-tuning method, which stores an index for each parameter, our sparsification method saves memory proportional to the model's parameter volume. Therefore, the more parameters a model has, the more memory our method saves.

In the revision, we will discuss the memory-saving effect of our sparsification mechanism.

Q3: Fine-tuning with fixed sparse matrix

Your question is very enlightening for us to discover more advantages of our sparsification mechanism. Please see Tables R6 and R7 in the response pdf for details. The performance of our sparsification mechanism surpasses that of fixed sparse matrices.

[R1] Krona, arXiV'22.

[R2] STR, ICML'20.

[R3] ST-3, WACV'23.

[R4] SPT, ICCV'23.

[R5] ETLSRR, AAAI'23

评论

I appreciate the detailed explanations provided by the authors. Your responses have largely addressed my concerns. However, I believe it is still important to conduct a thorough literature review and compare your work with significant existing works. In the attached PDF, I noticed that the performance of SNELL is not particularly impressive, with minimal differences compared to other methods and its own ablations (such as removing kernelized LoRA and fixing the sparsification method). Additionally, the authors should emphasize the motivation and novelty of the paper more prominently. Considering the above points, I have increased my score to 5.

评论

Dear Reviewer, thank you very much for your kind support and detailed feedback on our work. We greatly appreciate your suggestions, and we will further refine our revision by emphasizing the motivation and novelty of the paper more prominently. Additionally, we will elaborate more on the effectiveness of SNELL compared to other significant methods in performance and memory usage. Thank you again for your valuable feedback on our work, and we will strictly follow your guidance to refine our manuscript.

审稿意见
5

The paper introduces SNELL (Sparse tuning with kerNELized LoRA), a method aimed at reducing memory usage during PEFT of large pre-trained models. SNELL achieves this by decomposing the tunable weight matrices into low-rank matrices and utilizing a competition-based sparsification mechanism, thereby avoiding the storage of large weight matrices and tunable weight indexes. This approach is demonstrated to achieve SOTA performance on multiple visual recognition tasks while maintaining low memory usage, making it viable for large-scale models.

优点

  1. The introduction of kernelized LoRA for sparse tuning is a novel approach that effectively combines the advantages of LoRA's low memory usage with the performance benefits of sparse tuning.
  2. SNELL significantly reduces memory usage compared to traditional fine-tuning and other PEFT methods, which is critical for scaling to larger models.
  3. Extensive experiments demonstrate that SNELL achieves state-of-the-art performance on various benchmarks, including FGVC and VTAB-1k, while maintaining low memory usage.

缺点

  1. While the focus is on visual recognition tasks, it would be beneficial to explore the applicability of SNELL to other domains such as natural language processing. I will expect the authors can further show some results on the large language models.
  2. While LoRA offers benefits such as reduced memory usage, quick task switching, and the ability to serve multiple adapters, SNELL's use of dynamic masking during fine-tuning negates these advantages. I'm curious if the authors could provide results using pre-defined fixed masks instead, which might help retain some of LoRA's beneficial properties.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

Thank you for the insightful comments that help us extend the applicability and strengthen the novelty! We will diligently follow your guidance to further improve our work.

W1: Additional Results on Large Language Models.

Following your suggestion, we apply SNELL on LLaMA2-7B to adapt to the commonsense reasoning benchmark.

Experiments: We compare SNELL with LoRA and find that SNELL achieves a better performance. This shows the applicability of SNELL to NLP tasks. Many other vision PEFT approaches lack this capability, as they necessitate a full level of memory usage for fine-tuning as Figure 3(a) shows.

ModelBoolQPIQASIQAHellaSwagWinoGrandeARC-eARC-cOBQAAverage
LoRA-3269.879.979.583.682.679.864.781.077.6
SNELL-3271.482.980.782.180.982.668.080.878.7

In the revision, we will further explore the applicability of SNELL on different LLMs and benchmarks and update the experiment results.

W2: SNELL based on Pre-defined Fixed Masks.

Thank you for your suggestion. To address your concern, first, we respectfully clarify that dynamic masking can preserve the low memory usage and the ability to serve multiple adapters of LoRA.

  • Memory usage: SNELL achieves comparable performance with LoRA (Figure 3(b)).
  • Ability to serve multiple adapters: Our masking is determined by the learnable low-rank matrices, so it is dynamic during fine-tuning and deterministic after fine-tuning. Therefore, compared with LoRA's reparameterization process, SNELL only adds a fixed mask. Both LoRA and SNELL can facilitate multiple adapters.

Second, by incorporating pre-defined masks, it is indeed possible to retain LoRA's quick training speed.

  1. Experiments: We first provide the training time of kernelized LoRA with pre-defined fixed masks (KLoRA-8-Fixed) and SNELL.

    MethodK-LoRA-8-FixedSNELL-8
    Training time (s/img)0.6290.657

    Then we compare the performance on FGVC datasets between kernelized LoRA (KLoRA-8-Fixed) with pre-defined fixed masks, and SNELL. The fixed masks are generated by SPT [R1].

    MethodCUB-200NABirdsOxford FlowersStanford DogsStanford CarsMean
    KLoRA-8-Fixed88.082.199.089.488.489.4
    SNELL-889.083.999.390.688.690.3
  2. Analysis: We find that fine-tuning with fixed masks can improve training speed (0.629 vs. 0.657).
    However, compared to our dynamic masking, fixed masking can hardly identify and adjust the most task-relevant weights in an end-to-end fashion, which leads to performance degradation (89.4 vs. 90.3).

Nevertheless, we believe that quick task switching is worth studying, and we will keep exploring it in the future.

We will incorporate the above experiments and analysis in the revision.

[R1] Sensitivity-aware visual parameter-efficient fine-tuning, ICCV'23.

评论

Thank you for your response. Regarding the additional results on LLMs, I noticed that the accuracy is lower than that of Dora [1]. As a result, the performance of SNELL is not sufficiently impressive. Concerning W2, the issue is that loading and computing adapters stored in an unstructured sparse format can be slow, which, in my view, limits SNELL's applicability compared to LoRA.

Given these considerations, I will maintain my current score.

Reference:

[1]: Liu, Shih-Yang, et al. "Dora: Weight-decomposed low-rank adaptation." arXiv preprint arXiv:2402.09353 (2024).

评论

Dear reviewer,

Thank you very much for your detailed feedback on our work. We hope the following response can address your concerns.

Performance Comparison with DoRA. Despite the mentioned performance on LLM, we respectfully note that SNELL has been submitted to the machine vision track. In this sense, it would be more important to see how SNELL adapts the visual foundation models to downstream vision tasks with both high performance and low memory usage.

To further compare SNELL with DoRA for vision tasks, we conduct experiments on FGVC and find that SNELL outperforms DoRA.

MethodCUB-200NABirdsOxford FlowersStanford DogsStanford CarsMean
DoRA-3288.983.799.390.487.590.0
SNELL-3289.184.299.390.789.890.6

All these results indicate the effectiveness of SNELL on vision and its potential wide applicability. We will further refine our revision by providing the comparison between DoRA and SNELL on both vision and NLP PEFT benchmarks.

Applicability. The effectiveness of SNELL can consistently scale up to various large models, such as ViT-L, ViT-H, and LLaMA-2, indicating that applicability of SNELL is comparable with LoRA.

作者回复

We would like to thank all the reviewers for their diligent efforts and valuable suggestions, which have greatly contributed to improving the quality of our manuscript.

Summary of strengths:

We sincerely appreciate that you find our method:

  • novel and promising (reviewers CYCL, hyWE, and 7e64);

  • provides extensive evaluations and convincing results (reviewers QyQ5, hyWE, and 7e64);

  • achieves state-of-the-art performance (reviewers CYCL, hyWE, and 7e64);

  • exhibits low memory usage (reviewers CYCL, QyQ5, and 7e64);

  • scalable to large models (reviewers CYCL and 7e64);

  • presents comprehensive ablations (reviewers QyQ5, hyWE);

  • well-organized and well-written (reviewers QyQ5, hyWE, and 7e64).

最终决定

The paper presents SNELL, a novel approach that addresses the memory usage issues associated with sparse tuning in parameter-efficient fine-tuning (PEFT). By introducing kernelized LoRA and a competition-based sparsification mechanism, SNELL effectively combines the advantages of low memory usage and high performance. This innovation is significant for scaling PEFT to larger models. While there are some concerns about the clarity of the competition-based sparsification mechanism and the limited memory advantages over existing methods, the authors have sufficiently addressed the raised concerns, and the overall contributions of SNELL are strong. The approach shows state-of-the-art performance across multiple tasks with reduced memory usage, making it a valuable addition to the field. Therefore, I recommend accepting this paper, with suggestions to improve clarity and provide more detailed comparisons in future work.