PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
3.0
置信度
创新性3.0
质量3.3
清晰度2.8
重要性3.0
NeurIPS 2025

AdaMSS: Adaptive Multi-Subspace Approach for Parameter-Efficient Fine-Tuning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Parameter-Efficient Fine-TuningLow-Rank AdaptationLow-Rank Representation

评审与讨论

审稿意见
4
  • AdaMSS proposes a new parametrization for low-rank adaptation based on subspace ssegmentation.
  • The weights in transformer models can be written as W^0=W^0Z0\hat{W}_0 = \hat{W}_0 Z_0, with W^0\hat{W}_0 being the rank-truncated SVD of the pretrained weights W0W_0, and Z0Z_0 has an approximate block diagonal structure alon with an analytical solution.
  • This formulation is an application of the Low-Rank Representation (LRR) model.
  • The row space is segmented into KK smaller subspaces (blocks in the block diagonal matrix Z0Z_0), where each block contains columns of W0W_0 assined to the block.
  • Based on this formulation, the weights can be written as W0(k)=W^0(k)(Z0)(k)+Wˉres(k)W_0^{(k)} = \hat{W}_0^{(k)}(Z_0^*)^{(k)} + \bar{W}_{res}^{(k)}
  • The paper proposes a residual update ΔZ(k)\Delta Z^{(k)}, so that the final low-rank update can be written as a decomposition W(k)=W0(k)+A(k)B(k)C(k)W^{(k)} = W_0^{(k)} + A^{(k)}B^{(k)}C^{(k)}, with A and B initialized using the SVD of W0W_0 and C initialized using 0.
  • The paper further proposes to freeze ranks progressively based on the gradients of BB and CC, leading to adaptive budet allocation.
  • AdaMSS achieves a lower theoretical bound on generalization.

优缺点分析

Strengths

  • The AdaMSS formulation enables fine-tuning with very few additional parameters
  • The update is theoretically motivated from subspace segmentation, and the verification experiments confirm the theory
  • The bound on generalization is lower than prior methods, potentially offering better generalization capabilities

Weaknesses

  1. For the amount of complexity added, the performance improvement over prior methods like PiSSA are small (especially since PiSSA is much simpler)

    • The initialization of the A and B requires computing the SVD of weight matrices, which can be quite costly for large models
    • The authors mention that the matrices can be randomly initialized, no empirical experiments have been performed
    • While this method does reduce the number of parameters, in practice, rank 16 is very economical, and anything below that does not really reduce the training time/memory requirements much especially for large models.
    • The question then is, what other benefits/insights does the method offer? Can the authors perform some experiments that demonstrate generalization?
  2. For the experiments on large language models, the comparison with equivalent trainable parameter counts for LoRA and PiSSA is missing

    • A comparison with equivalent parameter count would be useful to see how quickly any method can adapt with minimal parameters, showing more benefit of the structured adaptation
  3. Missing citation and comparison

    • The residual update formulation in equation 8 (W(k)=W0(k)+W^0(k)ΔZ(k)W^{(k)} = W_0^{(k)} + \hat{W}_0^{(k)}\Delta Z^{(k)}) is very similar to [1]. Can the authors explain the difference and add a comparison?

Overall, the idea is interesting and the evaluation is good, and I would consider increasing my score if the authors can address my concerns.


References

[1] Chinmay Savadikar, Xi Song, and Tianfu Wu. "Generative Parameter-Efficient Fine-Tuning." arXiv preprint arXiv:2312.00700 (2023).

问题

For all the experiments, which modules are fine-tuned (for e.g, Query Value, etc.)?

局限性

Please refer to point (1) in the weaknesses for a limitation that I have observed.

最终评判理由

My main concern was regarding the overhead of the method compared to LoRA and PiSSA, and whether it is justified. Because of the following two points, I have increased my score and recommend acceptance:

  • The authors have demonstrated that with AdaMSS can adapt models to downstream tasks better than prior methods with comparable number of parameters. This shows the presence of meaningful subspaces in pretrained model weights, which can potentially open more avenues for future research.
  • AdaMSS shows theoretical guarantees for loer test error.

格式问题

N/A

作者回复

We sincerely appreciate the reviewer’s recognition of our work and the constructive feedback and insightful questions provided. Below, we address the concerns and questions raised.


R1. Performance improvement over prior methods like PiSSA Thank you so much for your comments. Our method consistently outperforms other approaches—including PiSSA—in terms of accuracy, often with comparable or significantly fewer trainable parameters.

  • For Image Classification (IC), our method achieves over 6% higher accuracy than PiSSA on ViT-Base with a similar parameter budget.
  • For Natural Language Understanding (NLU), our method achieves over 7% higher accuracy than PiSSA on RoBERTa-Large, while using much fewer parameters. In our newly added experiments (as suggested by the reviewer), we have included results for PiSSA with r=1r=1 (see Tables 1–2). These results show that, under similar parameter conditions, our method performs consistently across tasks on both RoBERTa-Large and RoBERTa-Base, and achieves substantially higher average performance compared to PiSSA.
  • For Natural Language Generation (NLG) (see additional results in the supplementary material), our method outperforms PiSSA on LLaMA 2-7B by: 7% higher accuracy on GSM8K, and 1.7% higher accuracy on MATH, while using fewer trainable parameters.

Below we summarize tasks where our method achieves at least 3% higher accuracy on average compared to other methods. Bold entries indicate cases where AdaMSS reduces the number of trainable parameters by more than 75% yet still achieves superior results.

TasksPiSSALoRALoRETTA
ICViT-BaseViT-Base, ViT-LargeViT-Large
NLURoBERTa-Base, RoBERTa-LargeRoBERTa-Base, RoBERTa-LargeRoBERTa-Large
NLGLLaMA 2-7BLLaMA 2-7BLLaMA 2-7B

R2. The cost of initialization of AdaMSS In our initialization phase, the most time-consuming step is the estimation of the number of subspaces. As discussed in our ablation study (see Appendix), when the number of subspaces is fixed to K=10K = 10, the performance of our method with fixed KK is comparable to AdaMSS with subspace number estimation strategy.

In the current submitted supplementary material, we further compare AdaMSS without subspace estimation to other PEFT methods (including PiSSA, LoRA, LoRA-PRO, and LoRETTA) on LLaMA 2-7B, Mistral-7B, and Gemma-7B. In most cases, our method still achieves the best performance, even without estimating the number of subspaces. To make AdaMSS even more practical, we also apply low-rank SVD techniques to accelerate the subspace initialization process.

As shown in the current submitted supplementary material, our method initializes faster than LoRETTA and completes in just a few seconds on LLaMA 2-7B, highlighting its computational efficiency.


R3. Comparison with random initialization We have included a comparison between random initialization and orthogonal initialization in the current submitted supplementary material (Figure 1). As shown by our experimental results, orthogonal initialization significantly outperforms random initialization.


R4. Training time/memory comparisons Owing to space constraints, please see our response given in R1 for Reviewer J6fY.


R5. Clarifying generalization and experiments that demonstrate generalization The generalization ability of a method is reflected in its expected test error. As established in Section Analysis of AdaMSS, AdaMSS exhibits a lower upper bound on the expected test error compared to other methods, given the same number of training samples. This theoretical advantage is further supported by our experimental results, which consistently show lower average test errors for AdaMSS across a range of tasks.


R6. Advantages of AdaMSS As discussed in R1 and R4–R5, our method offers the following key advantages.

  • (1) Superior Accuracy Across Tasks.
  • (2) Reduced Training Time and Memory Usage.
  • (3) Stronger Theoretical Guarantee: Theorem 1 establishes an upper bound on the expected loss of AdaMSS, providing a formal generalization guarantee that supports its theoretical soundness.

R7. Comparison with equivalent trainable parameter counts for LoRA and PiSSA for language models Thank you so much for the valuable suggestion. Our original comparisons on ViT models already included PiSSA (r=1)(r=1) as a baseline.

Following your recommendation, we have extended our experiments on language models (RoBERTa) to include LoRA (r=1)(r=1) and PiSSA (r=1)(r=1) under minimal parameter settings, in addition to our method AdaMSSbase_{base}. Please refer to Tables 1 and 2 below.

As the experimental results demonstrate, our method consistently outperforms both LoRA and PiSSA under a low-parameter budget. Furthermore, AdaMSS exhibits more stable performance across tasks, highlighting the advantage of structured subspace updates in resource-constrained settings.

Table 1. Results of LoRA (r=1)(r=1), PiSSA (r=1)(r=1), and AdaMSSbase_{base} on RoBERTa-Base

ModelTrainable ParamCoLAMRPCQNLIRTESST-2STS-BAvg.
LoRA55K62.3±3.5789.22±0.3190.59±0.3079.49±0.4293.74±0.4580.81±20.6482.7
PiSSA55K62.6±1.4489.26±0.6190.61±0.4174.87±1.2293.28±0.2089.97±0.2883.4
AdaMSS32K64.5±1.188.8±1.492.4±0.177.3±0.794.6±0.290.4±0.184.7

Table 2. Results of LoRA (r=1)(r=1), PiSSA (r=1)(r=1), and AdaMSSbase_{base} on RoBERTa-Large

ModelTrainable ParamCoLAMRPCQNLIRTESST-2STS-BAvg.
LoRA147K62.16±2.3888.33±0.6593.86±0.1682.24±2.4795.67±0.4178.15±29.7283.4
PiSSA147K56.58±6.1984.90±3.3593.40±0.3165.92±11.2695.18±0.2391.26±0.2481.2
AdaMSS45K67.2±1.290.3±0.594.5±0.187.1±2.196.1± 0.091.9±0.087.9

R8. Add citation [1] and comparison with [1] We sincerely thank the reviewer for bringing this excellent work to our attention. WeGeFT is recently accepted method at ICML 2025. In the revised version of our manuscript, we have included a discussion of WeGeFT [1] and added experimental comparisons against it.

Below, we provide a summary of the differences between AdaMSS and WeGeFT, as well as empirical comparisons on Natural Language Understanding (NLU) and Image Classification (IC) tasks.

  • (1) Motivation: AdaMSS is motivated by the multi-subspace structure of pretrained weight matrices. It seeks compact representations via lowest-rank representation and subspace segmentation, followed by a multi-subspace-based adaptive budget allocation strategy to reduce the number of trainable parameters while preserving model capacity.

In contrast, WeGeFT focuses on cross-layer parameter sharing, under the assumption that pretrained weights already contain useful "transferable knowledge" for downstream tasks.

  • (2) Methodology: AdaMSS leverages low-rank representation and subspace segmentation to obtain a compact parameterization of the weights: W(k)=W0(k)+W^0(k)ΔZ(k)\mathbf{W}^{(k)} = \mathbf{W}_0^{(k)} + \mathbf{\hat{W}}_0^{(k)} \Delta \mathbf{Z}^{(k)}, where W(k)\mathbf{W}^{(k)} consists of the weights belonging to the kk-th subspace, ΔZ(k)\Delta \mathbf{Z}^{(k)} is low-rank matrix, and W^0(k)\mathbf{\hat{W}}_0^{(k)} represents the principal components of the corresponding weight subspace.

    WeGeFT adopts the formulation Wl=W0l+W0lΔZ\mathbf{W}^l = \mathbf{W}_0^{l} + \mathbf{W}_0^{l} \Delta \mathbf{Z}, where ΔZ\Delta \mathbf{Z} is a low-rank matrix shared across layers, and Wl\mathbf{W}^l denotes the weight matrix of the ll-th layer.

Comparison with WeGeFT

Table 3 compares the performance of AdaMSS and WeGeFT on NLU tasks.
Table 4 shows their results on IC tasks.
Overall, AdaMSS outperforms WeGeFT on average for NLU (RoBERTa) and has a comparable performance for IC (ViT).

Table 3. Comparison with WeGeFT on NLU (RoBERTa)

ModelTrainable ParamCoLAMRPCQNLIRTESST-2STS-BAvg.
RoBERTa-Base (WeGeFT)49K63.5±1.389.46±0.4991.23±0.4278.6±1.6094.13±0.4690.5±0.0684.57
RoBERTa-Base (AdaMSS)32K64.5±1.188.8 ± 1.492.4±0.177.3±0.794.6±0.290.4±0.184.7
RoBERTa-Large (WeGeFT)65K64.04±1.9975.74±7.6793.7±0.353.55±1.1994.95±0.3491.37±0.2678.89
RoBERTa-Large(AdaMSS)45K67.2 ± 1.290.3±0.594.5±0.187.1±2.196.1±0.091.9±0.087.9

Table 4. Comparison with WeGeFT on IC (ViT)

ModelTrainable ParamCIFAR10CIFAR100StanfordCarsOxfordPetsFGVCEuroSATRESISC45Avg.
ViT-Base(WeGeFT)49K98.46±0.0691.46±0.1976.18±0.2092.71±0.1551.82±0.8694.92±6.8993.03±0.1985.51
ViT-Base(AdaMSSbase_{base})42K98.71±0.0791.90±0.178.98±0.293.91±0.253.2±0.498.64±0.0793.62±0.0886.99
ViT-Large (WeGeFT)204K99.06±0.0792.69±0.1983.96±0.1494.49±0.1760.82±0.4698.34±0.1094.49±0.2689.12
ViT-Large (AdaMSS)241K99.12±0.0093.22±0.185.24±0.394.87±0.164.31±0.498.93±0.194.87±0.190.13

R9. Which modules are fine-tuned?

For both IC and NLU, we fine-tune only the Query and Value projection modules across all methods for a fair comparison. For Natural Language Generation, we evaluate two settings: fine-tuning only the Query and Value projections, and fine-tuning all major projection modules.


We thank the reviewer again for the thoughtful questions. Please let us know if any aspects remain unclear—we would be truly grateful for the opportunity to further clarify.

评论

Thank you for the detailed response. My concerns have been addressed, and I will raise my score from 3 to 4.

评论

We sincerely thank the reviewer for the kind follow-up and for updating the score. We are truly grateful for the constructive feedback.

审稿意见
4

This paper introduces AdaMSS, a parameter-efficient fine-tuning method that divides neural network weights into multiple subspaces and adapts them with low-rank updates. By dynamically allocating parameters to the most important subspaces, AdaMSS achieves strong performance on vision and language tasks with far fewer trainable parameters than previous methods.

优缺点分析

Strengths

  • The approach of further reducing LoRA’s trainable parameter count through multiple subspaces while largely maintaining performance is interesting. The introduction of Low-Rank Representation is relatively novel, and the final weight block update strategy is quite clever.
  • The paper analyzes the method’s complexity.

Weaknesses

  • The writing of the paper could be improved, for example by including a method diagram to help readers quickly grasp the main idea. The presentation of experimental results is not intuitive enough, making it hard for readers to immediately see the advantages of the method. The explanation of the importance score in the paper also lacks intuitiveness.
  • The paper could include more discussion comparing with other LoRA-based methods that leverage subspaces.

问题

  • The paper claims that lower model complexity leads to better generalization, but this statement is not rigorous enough. Please provide a more detailed explanation.
  • Please discuss in more detail how the proposed method compares to other LoRA variants that utilize subspaces.
  • How does the proposed method perform in LLM scenarios? Please cover as many base models and parameter scales as possible.

局限性

NA

最终评判理由

After reading the author's response, some of my concerns have been solved. I keep my previous justification of borderline acceptance.

格式问题

No

作者回复

We sincerely appreciate the reviewer’s recognition of our work and the constructive feedback and insightful questions provided. Below, we address the concerns and questions raised.


R1. Clarity of presentation

We appreciate the suggestions. Following the recommendations, in the revised version, we will include a method diagram to summarize the proposed adaptation process and improve readability by simplifying the notation and derivations.

For the experimental section, we will reorganize the existing ablation studies and introduce a new discussion section to better highlight and explain the advantages of our method.

We believe that these revisions will significantly enhance the clarity of our work.


R2. Intuitive explanation of importance scores

The importance score used in this work is defined based on the product of Iˉ(t)\bar{I}^{(t)} and Uˉ(t)\bar{U}^{(t)}, where:

  • Iˉ(t)\bar{I}^{(t)} measures the sensitivity of the weights, and it is defined on the magnitude of the gradient-weight product, which approximates the change in the loss function due to that weight.
  • Uˉ(t)\bar{U}^{(t)} captures the uncertainty in the sensitivity estimation.

Therefore, this importance score helps identify weights that either (1) contribute significantly to loss reduction, or (2) exhibit uncertainty, making them important candidates for training.


R3. Clarifying generalization

The superior generalization ability of a method refers to the following:
Given the same number of training samples and comparable training loss, the method achieves a lower expected test error than competing approaches.

To support this, Theorem 1 provides a generalization bound for AdaMSS, showing that the expected test error decreases with the Gaussian complexity of the learned function class. Since AdaMSS induces a function class with lower Gaussian complexity, it is theoretically expected to generalize better than methods with higher complexity.

This theoretical advantage is further validated by our empirical results, which consistently demonstrate that AdaMSS achieves lower average test errors across a range of tasks.


R4. Discussion in detail how comparing with other LoRA-based subspace methods

We thank the reviewer for the suggestion to clarify how AdaMSS compares to other LoRA variants that utilize subspaces. Following your suggestion, we will add a detailed discussion comparing our method with other LoRA-based approaches in the revised version. Besides, we will include a new Discussion Section and introduce additional experiments to better illustrate the advantages of AdaMSS.

Below is a summary of the discussion for your reference:

1. Key Differences Between AdaMSS and Other Subspace-Based LoRA Variants

While many existing PEFT methods—such as PiSSA, AdaLoRA and LoRA-GA—rely on the assumption of a single low-rank subspace, AdaMSS departs from this by modeling the pretrained weight space as composed of multiple distinct subspaces.

AdaMSS clusters the column vectors of the weight matrix into separate subgroups, each forming an independent subspace with its own low-rank representation.
Unlike AdaLoRA, which dynamically adjusts the rank of a single subspace, AdaMSS performs explicit subspace segmentation and dynamically estimates the importance of each subspace, enabling more compact expressiveness and finer adaptation.

2. Advantages of AdaMSS over other Methods

  • (1) Reduced Training Time and Memory Usage:
    Thanks to its lower number of trainable parameters and the proposed adaptive budget allocation, AdaMSS achieves a better performance in both memory efficiency and training speed.
    Moreover, our budget allocation scheme offers greater flexibility in balancing accuracy and training efficiency.

  • (2) Better Accuracy Across Multiple Tasks:
    AdaMSS consistently outperforms other methods—including PiSSA—on several benchmarks, only using fewer or comparable trainable parameters.
    In the following table, we summarize tasks where AdaMSS achieves at least 3% accuracy improvement over competing methods. Bold entries indicate cases where AdaMSS reduces the number of trainable parameters by more than 75% yet still achieves superior results.

TasksPiSSALoRALoRETTA
ICViT-BaseViT-Base, ViT-LargeViT-Large
NLURoBERTa-Base, RoBERTa-LargeRoBERTa-Base, RoBERTa-LargeRoBERTa-Large
NLGLLaMA 2-7BLLaMA 2-7BLLaMA 2-7B

In the revised version, we have extended our experiments on language models (RoBERTa) to include LoRA (r=1)(r=1) and PiSSA (r=1)(r=1) under minimal parameter settings, to better highlight and explain the advantages of our method. Please refer to Tables 1 and 2.

  • (3) Theoretical Guarantee:
    Our method provides a stronger theoretical foundation. Theorem 1 formally derives an upper bound on the expected test loss of AdaMSS, offering a generalization guarantee that is not available in most other LoRA-based approaches.

Table 1. Results of LoRA (r=1)(r=1), PiSSA (r=1)(r=1), and AdaMSSbase_{base} on RoBERTa-Base

ModelTrainable ParamCoLAMRPCQNLIRTESST-2STS-BAvg.
LoRA55K62.3±3.5789.22±0.3190.59±0.3079.49±0.4293.74±0.4580.81±20.6482.7
PiSSA55K62.6±1.4489.26±0.6190.61±0.4174.87±1.2293.28±0.2089.97±0.2883.4
AdaMSS32K64.5±1.188.8±1.492.4±0.177.3±0.794.6±0.290.4±0.184.7

Table 2. Results of LoRA (r=1)(r=1), PiSSA (r=1)(r=1), and AdaMSSbase_{base} on RoBERTa-Large

ModelTrainable ParamCoLAMRPCQNLIRTESST-2STS-BAvg.
LoRA147K62.16±2.3888.33±0.6593.86±0.1682.24±2.4795.67±0.4178.15±29.7283.4
PiSSA147K56.58±6.1984.90±3.3593.40±0.3165.92±11.2695.18±0.2391.26±0.2481.2
AdaMSS45K67.2±1.290.3±0.594.5±0.187.1±2.196.1± 0.091.9±0.087.9

R5. Cover as many base models and parameter scales as possible

Thank you so much for the comment. Our current experiments cover a diverse set of models, including ViT (Base/Large), RoBERTa (Base/Large), LLaMA 2-7B, Mistral-7B, and Gemma-7B, with parameter sizes ranging from 85.8M to 8.5B.
In most cases, our method achieves the best performance across these models.

Due to the limited rebuttal timeline, we were unable to include results on larger-scale models in this version. However, we are actively conducting these experiments and will incorporate a more comprehensive comparison with larger-scale models in the revised submission.


We thank the reviewer again for the thoughtful questions. Please let us know if any aspects remain unclear—we would be truly grateful for the opportunity to further clarify.

评论

Thank you for the author's response, I keep my positive score.

评论

Thank you so much for your follow-up and for maintaining a positive score. We are truly grateful for the constructive feedback. We hope our responses have addressed your concerns—please kindly let us know if there’s anything we may have missed or could clarify further.

审稿意见
5

This paper proposes AdaMSS, a parameter-efficient fine-tuning (PEFT) method that adaptively updates a small subset of subspaces within a neural network's weight matrix. Unlike existing methods like LoRA or PiSSA that rely on a single low-rank subspace, AdaMSS identifies multiple subspaces via low-rank representation and performs fine-tuning in a block-diagonal, multi-subspace manner. It also introduces an adaptive budget allocation mechanism that selectively updates only the most relevant subspaces during training. The authors provide theoretical guarantees showing improved generalization bounds over prior PEFT methods. Extensive experiments across vision and language tasks demonstrate that AdaMSS achieves competitive or superior performance with significantly fewer trainable parameters.

优缺点分析

Strengths

  1. Insightful empirical finding on multi-subspace structure

The paper presents an interesting and novel empirical observation: pretrained weight matrices across various models (e.g., ViT, RoBERTa, LLaMA) exhibit a multi-subspace structure. This is revealed through Low-Rank Representation (LRR) and spectral clustering, where the coefficient matrix shows an approximate block-diagonal form. This observation serves as a strong motivation for moving beyond single-subspace LoRA-style methods and is well supported by visual and quantitative evidence.

  1. Well-motivated Multi-Subspace-Based Incremental Update

The proposed AdaMSS framework is a conceptually clean and practically relevant solution. By projecting pretrained weights into multiple orthogonal subspaces and assigning fine-tuning capacity selectively based on subspace importance, the method provides a structured, task-aware approach to parameter-efficient fine-tuning. The use of residual connections further ensures expressiveness while maintaining modularity.

  1. Strong theoretical analysis

The paper is theoretically grounded, providing clear generalization bounds via Gaussian complexity analysis. The bounds support the core claim that multi-subspace adaptation leads to improved generalization compared to global low-rank methods. This theoretical support enhances the credibility and rigor of the proposed method.

  1. Solid empirical performance across domains

AdaMSS demonstrates strong and consistent performance improvements over existing PEFT methods such as LoRA, PiSSA, and RandLoRA across a wide range of benchmarks. Notably, it achieves these gains with significantly fewer trainable parameters, which makes it especially suitable for large-scale or resource-constrained applications.

Weaknesses

  1. Sufficiency of empirical evidence for subspace structure

Figure 2 is a good illustrative starting point, but on its own, it's not sufficient evidence to claim that pretrained models broadly exhibit a multi-subspace structure. More layer-wise or task-relevant analysis would be needed to make the case compelling.

  1. Lack of comparative baselines for adaptive subspace selection

The proposed adaptive budget allocation strategy, which freezes or updates subspaces based on gradient magnitude and variance, is intuitive. However, it would be useful to see direct comparisons with simpler or alternative selection strategies, such as random subspace selection or allocating budget based on principal singular directions within each cluster. This would help isolate the specific benefit of the adaptive mechanism.

  1. Sensitivity to preprocessing and hyperparameters

The method introduces several preprocessing steps and hyperparameters, such as the number of subspaces 𝐾 clustering thresholds, and gradient-based pruning schedules. It remains unclear how sensitive the method is to these choices or how automated the process is in practice. More guidance or ablation studies would be helpful, especially for practitioners looking to apply the method to new domains.

Writing Clarity

The paper can be somewhat difficult to follow in parts. The writing is dense, particularly in Sections 3.1 and 3.2, where the mathematical derivation of subspace segmentation and reparameterization could be broken down more clearly. The interplay between different components is not always intuitive on first read. More visual aids or examples could significantly improve clarity for readers unfamiliar with subspace learning techniques.

问题

Please check the weakness part. will increase the score if the concerns are solved.

局限性

Please check the weakness part

最终评判理由

I have read the thorough rebuttal by authors. The paper has strong theoretical analysis with solid performance across boards.

格式问题

no

作者回复

We sincerely appreciate the reviewer’s recognition of our work and the constructive feedback and insightful questions provided. Below, we address the concerns and questions raised.


R1. Sufficiency of empirical evidence for subspace structure Thank you so much for your constructive suggestions. In addition to Figure 2, we will include more visualizations of empirical evidence for multi-subspace structures across layers and tasks in the revised manuscript. Owing to space constraints, we provide partial numerical evidence for multi-subspace structures across different layers.

In addition to the approximate block-diagonal patterns shown in Figure 2, another way to observe the distribution of weight column vectors is through the singular values of the Laplacian matrix constructed from the principal components. In Table 1, we show the singular value distribution of the Laplacian matrix computed from the principal components of the query weight matrices at each layer of a pretrained ViT-Large model. From Table 1, we make the following observation: for most layers, the first 900 singular values remain significantly large, while the trailing values are very close to zero (e.g., below 0.01).

The number of near-zero singular values of the Laplacian matrix can be interpreted as an estimate of the number of disjoint subspaces within the weight space. This provides an additional numerical evidence of the multi-subspace structures within the principal components of pretrained weights.

Table 1: Singular value distribution of the Laplacian matrices constructed from the principal components of the query weight matrices at each layer of the pretrained ViT-Large model.

Layer indexσ1\sigma_1σ101\sigma_{101}σ201\sigma_{201}σ301\sigma_{301}σ401\sigma_{401}σ501\sigma_{501}σ601\sigma_{601}σ701\sigma_{701}σ801\sigma_{801}σ901\sigma_{901}σ1001\sigma_{1001}σ1024\sigma_{1024}
11.000.870.780.590.380.210.090.020.000.000.000.00
31.000.830.780.720.650.570.470.370.240.100.000.00
51.000.830.810.790.770.740.700.640.540.380.050.00
71.000.820.810.790.770.740.710.670.600.470.010.00
91.000.820.810.790.780.760.740.710.670.560.000.00
111.000.830.810.800.790.770.750.730.680.550.000.00
131.000.830.810.800.780.760.740.710.660.520.000.00
150.990.810.790.770.760.730.710.670.630.530.000.00
171.000.820.800.790.770.760.740.710.680.600.000.00
191.000.820.810.800.780.770.760.740.710.660.000.00
211.000.820.810.800.800.790.780.770.750.720.000.00
231.000.830.820.810.800.800.790.780.770.750.120.00

R2. Lack of comparative baselines for adaptive subspace selection

Table 2: Comparing the performance of different adaptive allocation scheme on ViT-Large model.

Methods \ DatasetStanfordCarsCIFAR100FGVC
Importance-score-based adaptive budget allocation85.24±0.393.22±0.0164.31±0.4
Random adaptive selection83.63±0.3093.11±0.1059.36±0.72
Adaptive selection based on 1\ell_1 norm of each subspace parameters79.89±0.4192.00±0.2140.77±1.13

Thank you so much for your constructive suggestions. We agree that including comparisons against alternative strategies will help clarify the advantages of our adaptive allocation scheme.

In the revised manuscript, we include comparisons with the following two adaptive allocation strategies:

  • (1) Random adaptive selection, as suggested by the reviewer.
  • (2) 1\ell_1-norm-based adaptive budget allocation, where the budget is assigned based on subspaces with the highest 1\ell_1-norm, although we currently do not directly implement allocation based on principal singular directions within each cluster as suggested.

As shown in Table 2, 1\ell_1-norm-based adaptive budget allocation and random adaptive selection lead to significant performance degradation, and importance-score-based adaptive budget allocation has achieved best performance among all compared methods.


R3. Sensitivity to hyperparameters and preprocessing

Thank you so much for your constructive suggestions. In the current version of the manuscript, we have already compared strategies with and without subspace number estimation, as well as the effect of using different fixed values of KK.

Following your suggestion, we have reorganized our existing ablation studies and added new ablations from the following two aspects:

  1. Hyperparameters for subspace segmentation, including the thresholds τ\tau and K0K_0 used for estimating the number of subspaces KK;
  2. Hyperparameters for adaptive budget allocation, including the target number of subspaces KtargetK_{\text{target}} and the decay exponent ρ\rho used in the proposed multi-subspace adaptive budget allocation scheme.

Below we summarize the findings of these new ablation studies:

  • (1) Thresholds τ\tau and K0K_0 for subspace estimation: The threshold τ\tau is used to detect near-zero singular values in the normalized Laplacian matrix and is set as a small value. In our experiments, we set τ=0.01\tau = 0.01. The parameter K0K_0 serves as a lower bound on the estimated number of subspaces. In Tables 3–5, we evaluate various values τ\tau ∈ {0.001, 0.01, 0.05, 0.10, 0.15, 0.20} and K0K_0 ∈ {1, 5, 10, 15, 20} using ViT-Large with AdaMSSbase_{base} (without adaptive budget allocation).

    The results show that when K010K_0 \geq 10, AdaMSSbase_{base} exhibits robust performance across all three datasets for different τ\tau and K0K_0.

  • (2) Target number of subspaces KtargetK_{\text{target}} and decay exponent ρ\rho: In the adaptive budget allocation scheme, the number of trainable subspaces is gradually reduced to KtargetK_{\text{target}} according to a smooth cubic decay schedule (default ρ=3\rho = 3). A larger ρ\rho indicates a faster decay in the number of active subspaces. In Tables 6–8, we evaluate the sensitivity to ρ\rho ∈ {1, 2, 3, 4, 5} and KtargetK_{target} ∈ {100, 200, 300, 400} using ViT-Large with AdaMSS. The results indicate that AdaMSS is robust to across a range of ρ\rho and KtargetK_{\text{target}}.

Table 3: Results of AdaMSSbase_{base} under different hyperparameter settings for StanfordCars.

τ\tau \ K0K_015101520
0.00184.25±0.2385.40±0.1985.40±0.2285.38±0.2585.67±0.42
0.0184.12±0.2485.20±0.5085.31±0.2785.41±0.1385.50±0.21
0.0584.77±0.1685.02±0.2685.23±0.2885.59±0.2785.70±0.24
0.1084.58±0.2984.93±0.2585.60±0.2685.36±0.2985.42±0.23
0.1584.72±0.3985.12±0.2385.38±0.3385.61±0.2085.63±0.45
0.2084.57±0.0984.86±0.1785.53±0.2185.39±0.3485.65±0.19

Table 4: Results of AdaMSSbase_{base} under different hyperparameter settings for CIFAR100.

τ\tau \ K0K_015101520
0.00193.27±0.0993.44±0.1693.47±0.0693.57±0.1293.48±0.15
0.0193.38±0.0893.33±0.1093.50±0.1093.53±0.1593.57±0.05
0.0593.42±0.1293.43±0.1293.51±0.0993.45±0.1093.56±0.09
0.1093.40±0.1293.47±0.0993.58±0.1393.45±0.1293.64±0.10
0.1593.43±0.1293.43±0.1493.52±0.0693.55±0.0693.60±0.12
0.2093.43±0.1093.54±0.0593.48±0.0893.53±0.0493.51±0.11

Table 5: Results of AdaMSSbase_{base} under different hyperparameter settings for FGVC.

τ\tau \ K0K_015101520
0.00161.36±1.1265.33±0.5566.11±0.5366.62±0.7866.98±0.73
0.0161.90±0.5564.58±0.6965.27±0.6466.10±0.6966.60±0.68
0.0561.64±0.2264.63±0.9865.51±1.0566.63±0.4867.44±0.68
0.1062.42±0.3364.49±0.8765.88±0.6866.29±0.2167.16±0.82
0.1562.56±0.7364.90±1.0265.72±1.0466.86±0.3066.44±0.92
0.2062.00±0.5164.76±0.7865.68±0.9366.65±0.4166.86±0.60

Table 6: Results of AdaMSS under different hyperparameter settings for StanfordCars.

ρ\rho \ KtargetK_{target}100200300400500
184.78±0.3285.06±0.1685.34±0.1385.35±0.1185.16±0.12
284.88±0.3685.11±0.1985.21±0.3385.26±0.2085.21±0.41
385.03±0.2984.91±0.2485.21±0.3985.06±0.2085.02±0.18
484.69±0.2084.95±0.1385.03±0.2985.02±0.2185.19±0.13
584.89±0.2784.67±0.2585.04±0.2585.16±0.2685.37±0.36

Table 7: Results of AdaMSS under different hyperparameter settings for CIFAR100.

ρ\rho \ KtargetK_{target}100200300400500
193.47±0.1093.49±0.1193.41±0.1493.56±0.1093.34±0.12
293.41±0.1493.44±0.0793.54±0.0893.42±0.1393.47±0.07
393.42±0.1493.38±0.0593.56±0.0993.46±0.1493.55±0.08
493.25±0.1193.53±0.0993.40±0.1093.48±0.1593.50±0.07
593.35±0.1593.45±0.1193.37±0.0993.41±0.1293.41±0.12

Table 8: Results of AdaMSS under different hyperparameter settings for FGVC.

ρ\rho \ KtargetK_{target}100200300400500
164.76±0.5565.56±0.6064.98±0.8164.90±0.9865.73±0.48
264.55±0.7364.38±0.8365.14±0.8965.11±0.5764.99±0.70
363.73±0.3164.16±1.2264.77±0.3465.33±0.6465.77±0.21
463.02±0.8964.35±0.6264.64±0.8164.54±0.4765.13±0.61
562.74±1.5363.43±0.6365.12±0.8364.28±0.9965.32±0.73

R4. Writing clarity and visual explanation

Thank you for highlighting this issue. We will revise Sections 3.1 and 3.2 by simplifying the notation and derivations, and by adding visual illustrations for the proposed methods.
We hope these revisions will enhance the clarity and readability of our work.


We thank the reviewer again for the thoughtful questions. Please let us know if any aspects remain unclear—we would be truly grateful for the opportunity to further clarify.

评论

Thanks for the rebuttal which solved my concerns. I will increase my score.

评论

We sincerely thank the reviewer for the kind follow-up and for increasing the score. We are truly grateful for the constructive feedback.

审稿意见
5

The paper introduces AdaMSS, a new parameter-efficient fine-tuning (PEFT) method for large language models. Rather than using fixed low-rank adapters (like LoRA), AdaMSS selects a set of low-rank subspaces adaptively with a learned budget controller. The method is supported by solid theoretical analysis and evaluated across NLP and vision tasks. It shows promise in reducing trainable parameters while maintaining performance.

优缺点分析

Strengths:

  • The paper introduces a novel idea and provides a strong theoretical foundation.
  • It provides an extensive experimental study to demonstrate the viability of the approach across multiple domains, including LLMs and vision transformers.
  • The approach achieves significant parameter efficiency while maintaining high performance comparable to SOTA methods.

Weaknesses:

  • The paper does not include performance metrics such as GPU memory use or training speed, which are key for PEFT.
  • Some of the newer models, such as LLaMA3, DeepSeek, or Gemini, are not included in the experimental study.

问题

  • How well can this method scales with long-context or multi-step reasoning tasks?
  • Can it be extended with positional tuning approaches like RoCoFT?

局限性

Yes

最终评判理由

The paper presents a novel idea with a good theoretical foundation. The authors responded and addressed my concerns. They plan to provide a revised version with the additional results and discussion.

格式问题

N/A

作者回复

We sincerely appreciate the reviewer’s recognition of our work and the constructive feedback and insightful questions provided. Below, we address the concerns and questions raised.


R1. Performance metrics such as GPU memory or training speed

Thank you very much for pointing this out. In the revised version, we have added a comparison of different PEFT methods in terms of both memory usage and training time. Since our method use a smaller number of trainable parameters, it demonstrates superior memory efficiency, as shown in Tables 1 and 2.

Regarding training speed, as discussed in the paper, the computational complexity of our gradient updates is of the same order as that of LoRA and PiSSA when r=k=1Krkr = \sum_{k=1}^{K} r_k, where rr denotes the hyperparameter used in LoRA and PiSSA. However, in practice, the multi-subspace-based adaptive budget allocation allows for faster training by selectively updating only the most important subspaces (see Tables 3 and 4).

Besides, the proposed multi-subspace-based adaptive budget allocation provides greater flexibility in balancing training time and performance. In our paper, the number of trainable subspaces is gradually reduced to a target value KtargetK_{target} following a smooth decay schedule with decay exponent ρ=3\rho=3. A larger ρ\rho leads to a faster reduction in the number of trainable subspaces. Table 5 presents an ablation study demonstrating how different choices of ρ\rho affect training speed and final performance.

Table 1: Optimizer memory consumption (MB) of different PEFT methods on ViT-Large using fp32 precision.

MethodLoRA (r=16)(r=16)PiSSA (r=8)(r=8)LoRETTA (r=5)(r=5)AdaMSSbase_{base} (rk=1)(r_k=1)
Memory (MB)18.849.61.5842.136

Table 2: Optimizer memory consumption (MB) of different PEFT methods on Roberta-Large using fp32 precision.

MethodLoRA (r=8)(r=8)PiSSA (r=8)(r=8)LoRETTA (r=5)(r=5)AdaMSS (rk=1)(r_k=1)
Memory (MB)9.69.61.580.54

Table 3: Average training time (in seconds) of different PEFT methods on ViT-Large.

Datasets\MethodsLoRAPiSSALoRETTAAdaMSS (rk=3,k=1Krk30)(r_k=3, \sum_{k=1}^{K} r_k\geq 30)
FGVC1207.76691206.70171057.73281074.9871

Table 4: Average training time (in seconds) of different PEFT methods on Roberta-Large for natural language understanding task (STS-B).

MethodLoRA (r=8)(r=8)PiSSA (r=8)(r=8)LoRETTA (r=5)(r=5)AdaMSS (rk=1,k=1Krk10)(r_k=1, \sum_{k=1}^{K} r_k\geq 10)
Avg. Time (s)5325.775174.045441.474972.03

Table 5: Training time (in seconds) and performance (PCC) of AdaMSS with varying ρ\rho on Roberta-Large for natural language understanding task (STS-B).

ρ\rho51015
Avg. Time (s)4913.634780.974667.35
PCC91.66±0.0391.52±0.0291.35 ± 0.03

R2. Experimental study on newer models (LLaMA3, DeepSeek, Gemini)

Thank you so much for the valuable comments. Our current experiments cover a range of models, including ViT (Base/Large), RoBERTa (Base/Large), LLaMA 2-7B, Mistral-7B, and Gemma-7B. Among them, LLaMA 2-7B and Gemma-7B were released in 2023, while Mistral-7B was released in 2024.

Due to the time constraints of the rebuttal, we were unable to include results on LLaMA 3 and DeepSeek in this version. However, we are actively working on these experiments and will include a more comprehensive comparison with newer models in the revised version.


R3. Scalability to long-context or multi-step reasoning tasks

Thank you for the question. Our method is scalable to long-context and multi-step reasoning tasks. This is supported by our experimental results on GSM8K and MATH — two widely recognized benchmarks for multi-step symbolic and numerical reasoning. Compared to other methods, our approach achieves higher accuracy while requiring fewer trainable parameters, demonstrating both its efficiency and effectiveness in handling complex reasoning scenarios. The detailed results are provided below for reference.

Table 6: Perfromance of diferent PEFT methods on LLaMA 2-7B.

MethodTrainable ParametersGSM8KMATH
Full FT6738M49.057.22
LoRA320M42.305.50
PiSSA (r=8)(r=8)19M44.115.84
LoRA-PRO (r=8)(r=8)19M46.616.40
AdaMSSbase_{base} (rk=3)(r_k=3)4M51.107.57
AdaMSS(rk=3)(r_k=3)4M50.807.22

Table 7: Perfromance of diferent PEFT methods on Mistral-7B.

MethodTrainable ParametersGSM8KMATH
Full FT7242M67.0218.60
LoRA168M67.7019.68
PiSSA (r=8)(r=8)20M71.0020.40
LoRA-PRO (r=8)(r=8)20M69.5919.17
AdaMSS_{base}$$(r_k=3)4M70.7120.44
AdaMSS(rk=3)(r_k=3)2M70.7419.47

Table 8: Perfromance of diferent PEFT methods on Gemma-7B.

MethodTrainable ParametersGSM8KMATH
Full FT8538M71.3422.74
LoRA200M74.9031.28
PiSSA (r=8)(r=8)25M75.4829.59
LoRA-PRO25M75.9029.25
AdaMSSbase_{base} (rk=3)(r_k=3)6M75.3329.73
AdaMSS (rk=3)(r_k=3)4M76.4128.64

R4. Compatibility with positional tuning (e.g., RoCoFT)

We thank the reviewer for this insightful question. Indeed, our framework could potentially be integrated with positional tuning methods such as RoCoFT.

One preliminary yet intuitive idea is to apply positional tuning directly to the low-rank representation Z\mathbf{Z}, especially given that Z\mathbf{Z} often exhibits an approximately block-diagonal structure. Another promising direction, inspired by the AdaMSS, is to leverage subspace segmentation to guide RoCoFT in deciding which columns should be updated and which can be frozen, potentially enabling adaptive budget allocation during training.

We will include this discussion in the Future Work section of the revised version.


We thank the reviewer again for the thoughtful questions. Please let us know if any aspects remain unclear—we would be truly grateful for the opportunity to further clarify.

评论

Dear Reviewer J6fY,

Thank you again for your valuable feedback on our submission.

We have submitted a detailed rebuttal addressing your comments, including:

  • a comparison of training time and memory usage across different methods (Tables 1–5);
  • an update on newer models, including coverage of Mistral-7B (2024) and Gemma-7B (2024) (Tables 7–8), with additional comparisons (e.g., LLaMA 3 (2024) and DeepSeek) currently in progress for inclusion in the revised version;
  • a discussion on scalability to long-context and multi-step reasoning tasks (Tables 6–8);
  • a discussion on compatibility with positional tuning approaches.

Your recognition would mean a lot to us.

Best regards,
The Authors

评论

Hi Reviewer J6fY,

The authors have provided a detailed response; do they sufficiently address your concerns on baseline comparison and computational efficiency?

Best, AC

最终决定

AdaMSS is a novel PEFT method that segments pretrained weights into multiple low-rank subspaces, overcoming the limited expressiveness of single-subspace approaches. This multi-subspace design enables richer adaptation without significantly increasing trainable parameters, striking a better balance between expressiveness and efficiency. The method further introduces adaptive freezing to focus updates on the most important subspaces and provides a theoretical generalization bound with lower Gaussian complexity. Experiments across vision and NLP tasks confirm strong gains over existing PEFT methods while using fewer parameters.

The rebuttal addressed concerns with added efficiency metrics, expanded evaluation (Mistral-7B, Gemma-7B), stronger ablations, and clarified theory. While exposition could be clearer and results on the very latest models are pending, the contributions are timely, impactful, and well-supported, making the work valuable to the PEFT and efficient LLM communities and likely to inspire further research on structured subspace-based fine-tuning.