/10

Poster4 位审稿人

最低2最高3标准差0.4

ICML 2025

TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree

Yu-Yang Qian,Yuan-Ze Xu,Zhen-Yu Zhang,Peng Zhao,Zhi-Hua Zhou

OpenReview PDF

提交: 2025-01-20更新: 2025-08-15

摘要

关键词

Continual LearningSupervised Fine-TuningLarge Pre-trained ModelsLow-Rank AdaptationHierarchical Task Structure

评审与讨论

审稿意见

评分: 32025-03-10

This paper proposes a novel continuous learning approach, TreeLoRA (K-D Tree of Low-Rank Adapters), which exploits hierarchical gradient similarity to build layer-wise adapters for efficient CL.To achieve even greater efficiency, the authors develop a confidence lower bound based bandit techniques to efficiently explore the task structure. In addition, the authors provide theoretical analyses to demonstrate the validity of the proposed approach.

给作者的问题

Question:

Regarding the use of LCB to calculate the similarity between tasks especially in transformer-based models, is it computed only at the last layer of the model, or is each layer calculated individually? I noticed that the figures in the paper seem to indicate that only the last layer is calculated, so why not compute each layer separately, given that the features learned at each layer of the model are different?
The authors mention that the setting of the threshold $\delta$ does not need to be done manually and is done dynamically, why and how is this done?

论据与证据

The claims made by the authors are well-supported by clear theoretical proofs and experimental results, which effectively validate their assertions.

方法与评估标准

The proposed methods are effective in addressing the problem outlined in the paper.

理论论述

I have checked the proof of theory provided by the author and there are no obvious problems.

实验设计与分析

See Weaknesses (2).

补充材料

I have reviewed the supplementary material provided by the author including the code and additional proofs.

与现有文献的关系

The research is a study of model base capabilities, with potential implications for the broader scientific literature.

遗漏的重要参考文献

N/A.

其他优缺点

Weaknesses:

The definition of $f_{i}(w_{j})$ does not conform to common notation. It is recommended to swap $i$ and $j$ to write it as $f_{j}(\mathcal{T}\_i)$ , or consider using $\theta_{j}$ to represent the model parameters. This would enhance the clarity of the paper.
In the methods compared in this paper, it seems that there is a lack of comparison with some recent advanced continual learning methods [1, 2], which might have better performance than the approach proposed in this paper.

[1] Zhao, W., et al. Sapt: A shared attention framework for parameter-efficient continual learning of large language models. In ACL, 2024.

[2] Feng, Y., et al. Tasl: Task skill localization and consolidation for language model continual learning. In ACL, 2024.

其他意见或建议

Typos: I'm not sure if this is a typo, but I noticed that the variable $j$ in equation (1) seems to be unnecessary, as it not be defined.

作者回复

2025-03-31

We sincerely appreciate the reviewer's constructive feedback. In the following, we respond to each question.

Q1. "Regarding the use of LCB to calculate the similarity between tasks especially in transformer-based models, is it computed only at the last layer of the model, or is each layer calculated individually? I noticed that the figures in the paper seem to indicate that only the last layer is calculated, so why not compute each layer separately, given that the features learned at each layer of the model are different?"

A1. Thank you for your question. We would like to clarify that the LCB calculation in TreeLoRA is indeed performed layer by layer across the entire model. As described in Equation (2) in our paper, the LCB is computed as follows:

\mathrm{LCB}\_k= \begin{cases} \widehat{\mu}\_k-2 \sqrt{\frac{\log t}{n\_k}}, & \text { if } k \in \mathcal{L} \\\\ \max \\left\\{ \min\_{j \in \mathcal{C}} \\left\\{\widehat{\mu}\_j-2 \sqrt{\frac{\log t}{n\_j}}-\delta \\right\\} \\right\\}, & \text { if } k \notin \mathcal{L} \end{cases}

where $\hat{\mu}\_k = \frac{1}{|\operatorname{Select}\_k|} \sum\_{\tau \in \{\operatorname{Select}\_k\}} \hat{\xi}\_{\tau}^k$ is the estimated task similarity between the current task and the $k$ -th task group (i.e., the nodes in the branch of the selected leaf node at round $t$ ), $\mathcal{L}$ is the set of all leaf nodes, $\delta$ is the automatically determined threshold, and $\mathcal{C}$ is the child nodes of the $k$ -th node. Therefore, the LCB is computed for each layer of the model. By calculating the LCB layer by layer, TreeLoRA captures the similarity between tasks at various levels throughout the model hierarchy as illustrated in Figure 1, allowing us to better capture hierarchical task similarities, which is especially advantageous in transformer-based models. We will add these details in the revised version of the paper to provide further clarity. Thank you again for your valuable question!

Q2. "The authors mention that the setting of the threshold $\delta$ does not need to be done manually and is done dynamically, why and how is this done?"

A2. Thanks for your comment. Inspired by the K-D tree data structure [Bentley, 1990], the threshold $\delta$ does not require manual tuning. Specifically, during the construction of the K-D tree after each task, the gradient space is partitioned based on the distribution of task gradients. At each split, the threshold is computed by taking the median of the similarity (L1-norm) between each task gradient and the mean gradient within the corresponding task group. This approach ensures balanced tree growth and adaptive partitioning of the gradient space, without the need for manual threshold adjustments.

Q3. "definition of $f_i(w_j)$ does not conform to common notation. It is recommended to swap $i$ and $j$ to write it as $f_j(\mathcal{T}_i)$ , or consider using $\theta_j$ to represent the model parameters"

A3. Thanks for your comment. We will revise our paper accordingly and use clearer notations, which would enhance the clarity of the paper. Thanks again for your feedback.

Q4. "it seems that there is a lack of comparison with some recent advanced continual learning methods [1, 2]"

A4. Thank you for pointing out these two references. Following your suggestions, we add a comparison with these two recent advanced continual learning methods, SAPT [Zhao et al., ACL 2024] and TASL [Feng et al., ACL 2024]. For a fair comparison, we do not employ the generative replay in SAPT. The results, using meta / LLaMA-2-7B-Chat as the foundation model, are presented in the table below:

Metric	FIX	SeqLoRA	OGD	GEM	EWC	L2P	DualPrompt	HiDeLoRA	O-LoRA	SAPT	TASL	TreeLoRA
Op (%)	38.94	34.30	42.09	40.08	42.36	36.23	37.69	41.60	42.78	42.93	43.19	43.52
BWT (%)	-	18.50	8.06	6.77	5.97	8.25	8.03	7.12	7.16	5.49	4.58	3.46
Time (s)	-	1132	6416	7385	50283	899	912	1286	1293	1205	1185	485

We also add comparison with other recent methods, InfLoRA [Liang and Li, CVPR 2024], please refer to A3 for Reviewer bzQv for more details. We will add these results to the revised version, and will also add discussions with SAPT and TASL methods in the related work section.

We hope these clarifications address your concerns. Thanks again for your valuable comments.

审稿人评论

2025-04-03

Thank you for providing the experiments and explanations regarding my questions and concerns. However, I still have some issues with the experimental part:

Q1: You mentioned that you did not use SAPT's generative replay for a fair comparison. Why is disabling generative replay more fair? As far as I remember, the generative replay in SAPT does not use the original data but instead uses fabricated data, which should not affect fairness.

Q2: In the training times you provided, O-LoRA is surprisingly close to SAPT's time. In my understanding, O-LoRA involves computing orthogonal structures for each layer and incorporating them into gradient calculations, which should be time-consuming. Or perhaps the authors considered that O-LoRA does not retain all of LoRA blocks.

Q3: Why is the training time reported rather than the inference time?

作者评论

2025-04-04

We are grateful to the reviewer for the follow-up feedback. We address each of the additional questions regarding experiments as follows.

Q1. "You mentioned that you did not use SAPT's generative replay for a fair comparison. Why is disabling generative replay more fair? As far as I remember, the generative replay in SAPT does not use the original data but instead uses fabricated data, which should not affect fairness."

A1. We thank the reviewer for the question. To clarify, the generative replay mechanism requires maintaining a pre-generated dataset of pseudo data (as observed in the SAPT's codebase) or, alternatively, employing an additional generative model to produce pseudo data. In our opinion, this process introduces additional information beyond the original data stream. Therefore, we exclude this mechanism and instead adopt another strategy by storing a fixed number of data samples in a buffer. This ensures that all methods rely solely on the original data stream.

On the other hand, we also appreciate the idea of introducing generative replay in continual learning, which can be considered as a "plug-in" component. This component could be integrated into our method or O-LoRA, etc. We will conduct additional ablation studies for a more comprehensive evaluation.

Q2. "In the training times you provided, O-LoRA is surprisingly close to SAPT's time. In my understanding, O-LoRA involves computing orthogonal structures for each layer and incorporating them into gradient calculations, which should be time-consuming. Or perhaps the authors considered that O-LoRA does not retain all of LoRA blocks."

A2. We thank the reviewer for the question. We would like to clarify that although O-LoRA requires computing orthogonal structures for each layer during training, the additional computational cost remains acceptable. This is because the orthogonal regularization across different layers can be computed in a batched and parallelized manner — treating the LoRA adapters at different layers as one concatenated matrix. This strategy is implemented in both the original O-LoRA's and our codebase.

Q3. "Why is the training time reported rather than the inference time?"

A3. Thank you for your comment. In this paper, one of the key contributions of our proposed TreeLoRA is to explore the task structure in order to facilitate adaptation to new tasks by leveraging task-shared knowledge, therefore decreasing the training time and enhancing the efficiency. Regarding inference, our method incurs the same time cost as other LoRA-based methods since we do not modify the inference process. While reducing the inference time is also an important problem in the LLM field, our current framework is primarily designed to address the challenges associated with task adaptation speed and training overhead, and we will consider it as an important future work.

Thanks again for your time and feedback, we hope this response addresses your concerns.

审稿意见

评分: 32025-03-12

TreeLoRA presents a continual learning method that enhances the efficiency of updating large pre-trained models. By integrating layer-wise LoRA with a hierarchical gradient similarity tree, it improves knowledge retention while reducing computational costs. TreeLoRA mitigates catastrophic forgetting while maintaining efficiency in VITs and LLMs.