PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
5
3
4.0
置信度
创新性2.8
质量2.8
清晰度2.0
重要性2.5
NeurIPS 2025

Hybrid Re-matching for Continual Learning with Parameter-Efficient Tuning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Continual LearningParameter-Efficient Tuning

评审与讨论

审稿意见
4

This work proposes a hybrid re-matching method called HRM-PET for rehearsal-free continual learning with parameter-efficient tuning. Two types of re-matching are integrated into a whole framework. The direct re-matching is straightforward by replacing the incorrect initial task identity with the different predicted task identity. The confidence-based re-matching handles the more challenging case that direct re-matching fails to improve. The mismatched samples with wrong task identity are detected with a confidence threshold, and then the task identity of highest confidence is chosen from top-N identities. Moreover, cross-task instance relationship distillation is used to acquire task-invariant knowledge. Experiments on four datasets under five pre-trained settings demonstrate the competitive performance of HRM-PET.

优缺点分析

Strengths:

  1. The experiment evaluation is extensive and solid. Results of tables and figures demonstrate the performance of the proposed method over SOTA algorithms.
  2. The re-matching idea between the current example (with its class label prediction) and the task ID appears to be novel. Two kinds of re-matching schemes seem to be complementary to each other. Moreover, cross-task instance relationship distillation enhances the overall performance.

Weaknesses:

  1. Clarity. I feel very confused at several places when I read this paper first time. For example, if A and B are matched or re-matched, what are A and B in this hybrid re-matching method? It took me much time to answer such questions. Also, some notations should be clearly defined before using these notations.
  2. Quality. Personally I think the techniques used in this work are straightforward and lack some theoretical depth. For example, if \hat{t}_f is not equal to \hat{t}_s, the class label is predicted simply with \hat{t}_s. In the confidence-based rematching, a simple threshold method is adopted. In the instance relationship distillation, the standard meaure of KL divergence is employed. From these aspects, these techniques appears to be not novel.
  3. Originality. As above stated, the techniques used in this paper seem to be less original, except the overal re-matching framework.

Minor points:

  1. Page 7, line 216. "between HRM-PET and three representive algorithms in Table 2". "three" should be corrected as "two".

问题

  1. Fig. 5. I can not understand the underlying message what Fig. 5 wants to convey. Which one is better, and why?
  2. Fig. 4. I don't know what this figure aims to showcase.
  3. Fig. 2 appears to be a little mess. And where is the module of instance relationship distillation?
  4. Although re-matching improves the experiment performance for continual learning, is there any explaination or theory? Why can this method work?
  5. Can the re-matching procedure be casted in more reasonable framework, such as probabilistic reasoning, or VAE (I don't know if it can work, just mention some) something like that? Currently this method appears to be like "if-then-else" rules, lacking deep thoughts.

局限性

See above review for suggestions for improvement.

最终评判理由

I thank the authors for answering my concerns and questions. More experimental results were exposed in Author rebuttal, which make me satisfactory to the quality of this work. It appears that the authors' responses solve most of my concerns of this paper, and I would like to maintain my initial rating.

格式问题

Not applicable.

作者回复

Thank you for your insightful comments and questions.

Q1: Clarity

A1: For PET-based CL methods, the model first predicts task identity for each sample xx. Then, the corresponding PET parameters p_\{\hat{t}_{f}} are retrieved from the parameter pool PP, and are used for subsequent class prediction. When the predicted task identity equals the true task identity, we refer as matching. When they differ, it is referred to as mismatching. In our method, we correct the wrong task identity, which we term re-matching. We will add the this clarity in the revision.


Q2&Q3: Quality and Originality

A2&A3: Our HRM-PET is simple, yet effective and stable. The main contributions are notable:

Most importantly, we explore the core and long-existing problem in PET-based CL, the potential inaccuracy in the task-parameter matching process during the testing phase.

We propose a hybrid re-matching framework to improve the accuracy of task identity prediction. CTIRD alleviates the reliance on matching, facilitating shared knowledge without compromising task-specific knowledge.

Moreover, empirical results serve as evidence for the rationality and effectiveness of our method.


Q4: Explanation of Fig. 5

A4: Fig. 5 compares the attention regions of the model with (Ours) and without CTIRD (W/o). The first column displays attention maps generated using the correct task identity, while columns 2–10 correspond to attention maps produced using incorrect task identities. Two main observations can be drawn:

  1. For the first column, with CTIRD, the model still focuses on the key regions, indicating that CTIRD does not compromise the plasticity.

  2. For columns 2–10, even when incorrect task parameters are used, the model with CTIRD still attends to the critical features, e.g., the lizard’s head and limbs, the bird’s beak. Hence. CTIRD enhances knowledge sharing and reduces the model’s reliance on task matching, demonstrating the superiority of CTIRD.


Q5: Explanation of Fig. 4

A5: Fig. 4 illustrates the ratio of incorrectly and correctly matched samples that satisfy τ\tau, relative to the total number of actual mismatched and matched samples, respectively, across each task in the incremental sequence. Specifically, the blue curve shows that over 70% of all mismatched samples are detected by τ\tau, demonstrating that CRM can identify most mismatched samples during the incremental process. Additionally, the green curve indicates that 10–30% of correctly matched samples are misidentified as mismatches, highlighting the necessity of using Eq. 5 to determine whether task identity replacement is required.


Q6: Instance relationship distillation in Fig. 2

A6: The module of instance relationship distillation is in the section with training at the bottom of Fig. 2. The loss of instance relationship distillation is between the old feature space and the new feature space. We will add the module name in the revision to improve clarity.


Q7: Theory of re-matching improving the experiment performance for continual learning

A7: The accuracy of task identity prediction has been shown to be critical for improving CL performance, as demonstrated in HiDe-Prompt [77]. Assuming that in the class incremental learning (CIL) scenario, a total of TT tasks are defined as D={D1,D2,,DT}D = \{D_1, D_2, \ldots, D_T\}. In DtD_t of task tt, X_t\mathcal{X}\_{t} and Y_t\mathcal{Y}\_t are the domain and label of task tt. Let X_t=_jXt,j\mathcal{X}\_t=\bigcup\_{j}\mathcal{X}_{t,j} and Y_t={Y_t,j} \mathcal{Y}\_t=\{\mathcal{Y}\_{t,j}\}, where j{1,...,Y_t}j \in \{1,...,|\mathcal{Y}\_t|\} indicates the jj-th class in task tt. Given a pre-trained model f_θf\_\theta, CIL aims to learn P(xX_i,jD,θ)P(\boldsymbol{x} \in \mathcal{X}\_{i,j}|D,\theta) for the sample x_k=1tX_k\boldsymbol{x} \in \bigcup\_{k=1}^{t}\mathcal{X}\_{k}, where j{1,...,Y_t}j \in \{1,...,|\mathcal{Y}\_t|\} indicates the jj-th class in task tt. Based on the theorem of Bayes, the goal can be decomposed as:

P(xX_i,jD,θ)=P(xX_i,jxX_i,D,θ)P(xX_iD,θ)P(\boldsymbol{x} \in \mathcal{X}\_{i,j}|D,\theta) = P(\boldsymbol{x} \in \mathcal{X}\_{i,j}|\boldsymbol{x} \in \mathcal{X}\_{i},D,\theta) P(\boldsymbol{x} \in \mathcal{X}\_{i}|D,\theta)

where P(xX_iD,θ)P(\boldsymbol{x} \in \mathcal{X}\_{i}|D,\theta) is the task identity inference probability, P(xX_i,jxX_i,D,θ)P(\boldsymbol{x} \in \mathcal{X}\_{i,j}|\boldsymbol{x} \in \mathcal{X}\_{i},\mathcal{D},\theta) is the within-task prediction probability. Hence, when the performance of task identity inference is improved, the performance of CIL will be enhanced. For more proof details, we recommend referring to HiDe-Prompt.


Q8: More reasonable framework

A8: We sincerely thank for the insightful suggestion. The idea of framing re-matching within a probabilistic framework is indeed inspiring. Assume the input is x\boldsymbol{x}. The posterior probability that it belongs to class y=cy = c is P(y=cx,D,θ)P(y=c | \boldsymbol{x}, D, \theta). We can further decompose it into task identity inference and within-task prediction:

P(y=(i,j)x,D,θ)=P(y=(i,j)x,task=i,D,θ)P(task=ix,D,θ)\begin{aligned} P(y = (i,j) \mid \boldsymbol{x}, D, \theta) = P(y = (i,j) \mid \boldsymbol{x}, \text{task}=i, D, \theta) \cdot P(\text{task}=i \mid \boldsymbol{x}, D, \theta) \end{aligned}

where y=(i,j)y = (i,j) is the jj-th class in task tt. After training with PET, we obtain a parameter pool P={p1,p2,,pT}\boldsymbol{\mathcal{P}} = \{p_1, p_2, \dots, p_T\}, where pip_i denotes the learned parameters specific to task ii. Rematching involves inference with multiple pp. To aggregate these predictions in a principled manner, we draw upon Bayesian Model Averaging (BMA), which supports combining models under uncertainty about model selection. Applying BMA, the final posterior is expressed as:

P(y=(i,j)x,D,θ,P)=i=1TP(y=jx,task=i,θ,pi)P(task=ix,D,θ)P(y=(i,j) | \boldsymbol{x}, D, \theta, \boldsymbol{\mathcal{P}}) = \sum_{i=1}^T P(y=j | \boldsymbol{x}, \text{task}=i, \theta, p_i) \cdot P(\text{task}=i | \boldsymbol{x}, D, \theta)

However, not all task parameters can be weighted equally. We hope that the correct task identity has a higher weight. Therefore, we calculate the weights ϕt\phi_t by the conditions in our strategy, such as the confidence of CRM:

P(y=(i,j)x,D,θ,P)=i=1TP(y=jx,task=i,θ,pi)P(task=ix,D,θ)ϕtP(y=(i,j) | \boldsymbol{x}, D, \theta, \boldsymbol{\mathcal{P}}) = \sum_{i=1}^T P(y=j | \boldsymbol{x}, \text{task}=i, \theta, p_i) \cdot P(\text{task}=i | \boldsymbol{x}, D, \theta) \cdot \phi_t

In our paper, ϕi\phi_i acts as a binary gate (i.e., ϕi{0,1}\phi_i \in \{0,1\}). Our rematching is uniformly implemented in the above equation. We will include this insight in the revisions.


评论

Dear Reviewer 4szz,

Please consider to provide your valuable feedback to the authors, which will be truly helpful.

Thanks

审稿意见
4

This paper introduces HRM-PET, a novel replay-free continual learning method using Parameter-Efficient Tuning. To enhance inference-time parameter matching, HRM-PET proposes a Hybrid Re-matching mechanism that intelligently utilizes the model’s prediction distribution. It further integrates Cross-Task Instance Relation Distillation to cultivate task-agnostic knowledge and align feature representations. Comprehensive experiments demonstrate HRM-PET’s superior, state-of-the-art performance with minimal additional inference overhead.

优缺点分析

Strengths

  1. The paper proposes an innovative and effective Hybrid Re-matching strategy (DRM & CRM) that cleverly leverages the prediction distribution to address the critical inference-time parameter matching inconsistency in PET methods.
  2. The paper provides comprehensive experimental validation and extensive ablations.
  3. The code is available

Weaknesses

  1. The paper lacks precise explanations for subscript notations in equations (3), (5), and (7), leading to comprehension difficulties. Specifically, d~(x)\tilde{d}(x) appears to denote a class-level distribution, while d^(x)\hat{d}(x) seems to represent a task-level distribution. However, the subscript representations in equation (5) and (7) vary, requiring explicit clarification for the meaning of each subscript.
  2. It is unclear which prompt/parameter (e.g., t^f\hat{t}_f or t^s\hat{t}_s) is utilized to obtain d~(x)\tilde{d}(x) in equation (3). A clear statement specifying the parameter source for this prediction distribution is essential for reproducibility and understanding.
  3. While inference time is discussed, the paper lacks a comprehensive analysis of training time, which appears to significantly increase due to multiple additional forward passes.
  4. The ablation study does not sufficiently clarify the independent contributions and necessity of each re-matching module.
  5. Given that CRM appears to be an independent module from DRM, the rationale for performing DRM prior to CRM, rather than directly applying CRM for all mismatches, remains unclear and warrants further explanation or empirical validation.
  6. The paper exhibits several technical and typographical errors that detract from its formal presentation and readability. For instance, each equation should conclude with appropriate punctuation (e.g., a period or comma). Additionally, on line 115, "(LoRA)" appears to be incorrect. The phrase "Forfair, LoRA is employed as PET in HiDe-Prompt i.e. HiDe-LoRA and our method." on line 193 is unclear and grammatically awkward, requiring rephrasing for clarity.

问题

See weaknesses.

局限性

The proposed method may incur substantial computational overhead.

最终评判理由

I thank the authors for their detailed rebuttal, which has effectively addressed my primary concerns. While I believe the paper could be further strengthened with more robust theoretical validation and additional refinement of the writing, the clarifications provided are sufficient. Accordingly, I have raised my score to 4.

格式问题

No major formatting issues were observed in the paper.

作者回复

Thank you for your insightful comments and questions.

Q1: Explanations for subscript notations and distribution

A1: Clarification of subscript: d^(x)\hat{d}(x) and d~(x)\tilde{d}(x) are all class-level distribution. T\mathcal{T} is used to convert class in d^(x)\hat{d}(x) into task identity in equations 5 and 7.

Clarification of distribution: jj in equation 3, cjc_j in equations 5 and 7 all represent the index of the category. We will unify the expression in revision.


Q2: A clear statement specifying the parameter source for this prediction distribution

A2: In fact, we utilize t^_s\hat{t}\_{s} in DRM to obtain d~(x)\tilde{d}(x) in equation 3. DRM has corrected part of the task identities, using t^s\hat{t}_{s} can reduce subsequent repeated calculations.


Q3: A comprehensive analysis of training time

A3: During training, for each new task tt, we compute the feature embedding ek(x)e_k(x) for each sample under task tt using the top-KK (K=5K=5) previous task-specific parameters. Specifically, we first utilize the pre-trained task classifier gωg_{\omega} to obtain the top-KK old task identities with the highest confidence scores for all samples. The time cost of performing a forward through the classifier is negligible.

Subsequently, before training, each sample undergoes KK inferences to obtain KK features for knowledge distillation. Compared to training, the inference-only forward introduces minimal additional time overhead.

Experimentally, we measured the training time of baseline and HRM-PET on ImageNet-R on an RTX 3090 with a batch size of 64. As shown in the table, our method incurs only an additional 4.4% (i.e., 0.30 hours) of time overhead, which is acceptable. We will add the discussion in the revision.

MethodBaselineHRM-PET (ours)
Training Time6.75 hours7.05 hours

Q4: The independent contributions and necessity of each re-matching module in the ablation study.

A4: As shown in the table, we conduct more detailed ablations on ImageNet-R. We have the following observations:

  1. Compared with the baseline, each re-matching module brings improvements, which demonstrates independence.

  2. Combining CRM and DRM achieves better performance compared with either CRM or DRM, which shows the necessity of each re-matching module.

MethodSup-21KiBOT-21KiBOT-1KDINO-1KMoCo-1K
Baseline71.5273.6072.9870.4367.98
Baseline+DRM72.6074.0973.5871.1268.69
Baseline+CRM72.8074.3574.0171.3268.80
Baseline+CTIRD72.1874.0873.6471.0468.12
Baseline+DRM+CRM73.4074.7574.1871.8569.10
Baseline+DRM+CRM+CTIRD73.8675.2374.6472.3269.32

Q5: The rationale for performing DRM prior to CRM

A5: As shown in the table, we conduct ablations on the relationship between CRM and DRM. We make the following validation:

  1. Combining CRM and DRM achieves better performance than using either DRM or CRM alone, demonstrating their complementarity. In Figure 1 b, there is a partial intersection between correct and incorrect matching in the confidence-based distribution. DRM compensates for the limitations on these samples, thereby combining CRM and DRM further improves the matching.

  2. When CRM and DRM are used together, the individual performance gains relative to the baseline are partially offset.
    This suggests a degree of overlap in the sets of samples selected for re-matching by the two methods. To minimize redundant computation, we apply CRM after DRM in our pipeline.

  3. The model's performance is not sensitive to the execution order of CRM and DRM, further probing their complementarity and necessity.

MethodSup-21KiBOT-21KiBOT-1KDINO-1KMoCo-1K
Baseline71.5273.6072.9870.4367.98
Baseline+DRM72.6074.0973.5871.1268.69
Baseline+CRM72.8074.3574.0171.3268.80
Baseline+DRM+CRM73.4074.7574.1871.8569.10
Baseline+CRM+DRM73.6175.1074.2271.8069.18

Q6: The technical and typographical errors

A6: Thanks for the helpful comment. Line 193 states: For fairness, both our method and HiDe-Prompt use LoRA as the PET, and therefore HiDe-Prompt is referred to as HiDe-LoRA in the table. We promise to correct grammar, formulas, unclear expressions, and other technical and typographical errors in the revision.


评论

Thank you very much for your kind feedback. I deeply appreciate thoughtful feedback. However, I still have a few concerns about the following:

1.The proposed method appears to share its core concept with a previously published approach [1], which raises a concern about its novelty. Furthermore, since models for subsequent tasks are not trained on samples from previous tasks, it is unclear how they maintain valid prediction distributions for these past tasks. Prior works [1,2] address this by modeling feature distributions to synthesize prototypes. Could the authors clarify the specific mechanism in their method that mitigates this issue?

[1] Sun, Hai-Long, et al. "Mos: Model surgery for pre-trained model-based class-incremental learning." AAAI. 2025.

[2] Zhou, Da-Wei, et al. "Expandable subspace ensemble for pre-trained model-based class-incremental learning." CVPR. 2024.

2.The motivation behind the CRM design requires further explanation. My understanding is that its effectiveness relies on both E(d^(x))E(\hat{d}(x)) and E(d~(x))E(\tilde{d}(x)) being low. Is this interpretation correct? It is unclear how the method handles cases where one metric is high and the other is low. The design seems to focus primarily on E(d~(x))E(\tilde{d}(x)), which could lead to incorrect mismatch identification when E(d^(x))E(\hat{d}(x)) is high. Could the authors elaborate on this design choice and its justification?

3.Could the authors comment on the scalability of the proposed method? Specifically, does the inference time increase with the number of incremental tasks? It appears that the re-matching process for a growing number of past tasks could significantly increase computational costs.

评论

Thank you for your insightful and timely feedback.

Q1: Difference from MOS and the mechanism for maintaining valid prediction distributions for past tasks

A1:

Difference from MOS: Although MOS [1] also addresses the problem of matching inaccuracy through Self-Refined Retrieval Mechanism, their strategies and assumptions are fundamentally different from ours: MOS assumes that the sample is mismatched when the task identity t^_s\hat{t}\_{s} of the final predicted class is not same as the task identity t^_f\hat{t}\_{f} corresponding to the selected PTE parameters. Specifically, MOS utilizes different task identities for inference until t^_s=t^_f\hat{t}\_{s} = \hat{t}\_{f}. However, not all samples satisfy this assumption. In the incremental process, each task-specific PET's parameter and the final class classifier are trained jointly. When an input is inferred by a PET parameter with an incorrect task ID, the final classifier is likely to activate classes belonging to that same (erroneous) task. Therefore, although t^_f\hat{t}\_{f} and t^_s\hat{t}\_{s} are the same, they may both have wrong task identities. For example, in all mismatched samples of ImageNet-R, 55.71% of them are t^_ft^_s\hat{t}\_{f} \neq \hat{t}\_{s}. In contrast, our HRM-PET adopts a confidence-based rematching, under a different assumption for detecting and correcting mismatching based on prediction confidence rather than task identity consistency.

The mechanism for maintaining valid prediction distributions for past tasks: We follow the advanced PET-based method HiDe-Prompt to maintain valid prediction distributions for past tasks by modeling the categorical features as a Gaussian distribution and storing them. We will add these details and cite [1] and [2] as related works in the revision.


Q2:The motivation behind the CRM.

A2: Thank you for your helpful comments. We clarify the motivation of CRM as follows:

  1. CRM is divided into two stages, i.e., detecting mismatched samples and finding the correct task identity for the mismatched samples. During the detection phase, the sample with E(d~(x))E(\tilde{d}(x)) meeting the threshold is determined as mismatching. This process only depends on whether the task identity used in inference is correct, regardless of the high or low of E(d^(x))E(\hat{d}(x)).

  2. During the finding the correct task identity, we obtain the top-N candidate task identities from d^(x)\hat{d}(x) based on the predicted probability, so E(d^(x))E(\hat{d}(x)) potentially affects the recall of the correct task identity. On the one hand, we can increase N to increase the candidate range. On the other hand, in Equation 5, we compare the distribution confidence after inference with all candidate task identities to avoid selecting the wrong candidate identity.


Q3:Discussion of scalability

A3: To discuss the resource consumption on longer tasks, we conduct extra experiments on 20 tasks on ImageNet-R as shown in the table below. Since NN in CRM is set to 2, the inference cost of our method only depends on the number of mismatched samples, e.g., the number of samples that meet the confidence threshold. When the number of tasks increases to 20, the matching error rate is higher, which inevitably leads to an increase in inference time. However, we improve the utilization of inference resources. Our method achieves a higher gain in accuracy per millisecond of increased inference time. The inference time per image only increases by 0.57 ms, but the accuracy has increased by 3.1%. Certainly, further reducing absolute inference cost remains a promising direction.

Method10 Tasks20 Tasks
ANA_N \uparrowTime (ms) \downarrowANA_N \uparrowTime (ms) \downarrow
Baseline71.602.8168.032.81
HRM-PET73.863.2471.133.38

评论

I thank the authors for answering my comments.

I believe that the authors' responses can solve most of my concerns of this paper, and I would like to re-rate this paper to 4.

评论

Dear reviewer b4of,

Thank you for kindly providing invaluable suggestions on this work. We authors greatly appreciate the efforts you have made to improve our manuscript. In the revised manuscript, we will add detailed discussions as suggested. If accepted, we will include b4of in our acknowledgments.

Best Regards, Authors of paper 16355

审稿意见
5

This manuscript focused on the problem of continual learning with pre-trained models (Cl-PTM). Specifically, the authors focused the performance of task identity matching during the inference process, in which the task identity was predicted for each test sample, and then the task-specific parameters (e.g, prompts) were selected to make the final classification. The authors investigated that the task-identity match between test samples and task-specific parameters was still challenging in the existing methods, and usually suffered from a high mismatching rate. To address this issue, the authors proposed HRM-PET, a parameter-efficient tuning based on a re-matching strategy, to improve the match accuracy in the existing methods. Specifically, the predicted distribution with the parameters selected through the initial matching was further used to improve the matching results, one by direct re-matching and another by confidence-based re-matching. Besides, the authors further introduced cross-task instance relationship distillation to better acquire the task-invariant knowledge.

优缺点分析

Strengths

  1. This study investigated the core and long-existing problem in CL-PTM, the potentially inaccurate task-parameter matching process during the test phase, which is crucial in this area.
  2. The motivation of the proposed method was clear and evidently supported by empirical observations (e.g., Figure 1-a).
  3. The experimental part is relatively complete. Most of the recent baseline methods and the commonly used benchmarks are included. The ablation studies were clear and supportive. The results succeed to support the motivation of the proposed method.

Weaknesses

  1. One intrinsic trait is the proposed method is the need for extra inference compared to existing methods, only with initial matching. However, according to the empirical analysis, it seems that the extra inference process did not introduce too much computational overhead.
  2. The expressions of some detailed steps are confusing. More clarification can be made to help the readers better understand the technical details. See the Questions part for more details.

问题

  1. About the content between Lines 148–156: I find Section 3.2 clear until Eq. (5), but I struggle to grasp the intuition and derivation behind it. Could the authors elaborate on this equation in more detail? I have checked the supplementary materials but did not find a thorough explanation.
  2. Missing training details: Some training details are lacking, which makes it difficult to evaluate the computational complexity. For example, in Section 3.3, a subset si\mathbf{s}_i is selected for xix_i. However, the computation of this subset, as well as the subsequent steps in Eq. (8), seems to rely on ek(x)e_k(x), i.e., inference using previous task parameters. Does this mean that for each new task tt, we need to compute the feature embedding ek(x)e_k(x) for each sample under task tt using all previous task-specific parameters pkp_k? If so, this may introduce additional training overhead. Could the authors provide clarification and, if possible, quantitative evidence? It would also be helpful to include an algorithmic summary in the supplementary material.
  3. Minor expression improvements: In the first line of Eq. (7), I suggest using d^cj(xi)\hat{d}_{c_j}(x_i) instead of d^cj\hat{d}_{c_j} to improve clarity.

Overall, I do not have major concerns about this work. I look forward to the authors’ clarifications, which will inform my evaluation during the rebuttal phase and further reviewer discussions.

局限性

N/A

最终评判理由

I appreciate the effort from the authors during the rebuttal period. After carefully reading the responses from the authors and the reviews from my colleagues, I decided to raise my rating to "Accept".

格式问题

N/A

作者回复

Thank you for your insightful comments and questions.

Q1: Clarification of inference time

A1: The minimal additional inference time overhead is mainly attributed to the following three factors:

  1. Not all samples require additional inference in the re-matching. For DRM, we perform extra inference only on samples where t^_f\hat{t}\_{f} is not equal to t^s\hat{t}_{s}. For CRM, re-matching is conducted on samples whose confidence scores satisfy τ\tau. Hence, the average inference time per image increases marginally. In the table below, we provide the Ratio of samples undergoing re-matching under each strategy for clearer interpretation. | Method | ANA_{N}\uparrow | Time\downarrow | Ratio | |------------------|-----------------|------------------|-------| | Baseline | 71.60 | 2.81 | - | | Baseline+DRM | 72.60 | 2.91 | 12.2 | | Baseline+CRM+DRM | 73.86 | 3.24 | 30.3 |
  2. In CRM, NN is set to 2, only one additional forward pass through the ViT is required.
  3. In the baseline, each sample requires two forward passes through the ViT: one for task identity prediction and another for class prediction. Re-matching process involves only a single forward pass through the ViT, so the inference time does not double that of the baseline.

Q2: Explanation of Eq. (5)

A2: Eq.(5) describes how to find a more appropriate task identity when the sample xx is mismatched. Specifically, we first obtain the top-NN classes with the highest probabilities from the initial matching prediction distribution d^(x)\hat{d}(x):

\text{\Gamma} = \underset{\{c\_j\}\_{j=1}^N}{\text{argmax}}\ \hat{d}\_{c_j}(x)

Where \text{\Gamma} is the set of the top-NN classes. Candidate task identities are obtained by converting all classes to task identity in \text{\Gamma} with T\mathcal{T}. Then, we compare the confidence scores EE of the final prediction distributions generated by inference with the parameters pp corresponding to all task identities. The task identity with the highest confidence is selected as the corrected task identity t^s\hat{t}_{s}:

\hat{t}\_{s} = \underset{i \in \text{\Gamma}}{\mathrm{arg\,max}}\ E(g(h(x;p\_{\mathcal{T}(i)},\theta\_{ptm});\theta\_{g}))

The final class prediction y^_CRM{\hat{y}}\_{CRM} is derived from the prediction distribution corresponding to t^_s\hat{t}\_{s}:

y^_CRM=argmaxig(h(x;p_t^_s,θ_ptm);θ_g) {\hat{y}}\_{CRM} = \underset{i}{\text{argmax}} \, g(h(x;p\_{\hat{t}\_{s}},\theta\_{ptm});\theta\_{g})


Q3: Missing training details

A3: In fact, for each new task tt, we compute the feature embedding ek(x)e_k(x) for each sample under task tt using the top-KK (K=5K=5) previous task-specific parameters, rather than all old parameters. Specifically, we first utilize the pre-trained task classifier gωg_{\omega} to obtain the top-KK old task identities with the highest confidence scores for all samples. The time cost of performing a forward through the classifier is negligible. Subsequently, before training, each sample undergoes KK inferences to obtain KK features for knowledge distillation. Compared to training, the inference-only forward introduces minimal additional time overhead.

Experimentally, we measure the training time of baseline and HRM-PET on ImageNet-R on an RTX 3090 with a batch size of 64. As shown in the table, our method incurs only an additional 4.4% (i.e., 0.30 hours) of time overhead, which is acceptable. We will add the discussion in the revision.

MethodBaselineHRM-PET (ours)
Training Time6.75 hours7.05 hours

Q4: The expressions of some detailed steps are confusing

A4: Thanks for the helpful suggestion. We promise to fix the confusing expression in the revision.


评论

I appreciate the response from the authors. After reading it, my questions have been answered. Before the internal discussion period, I decided to maintain my current positive rating for this manuscript.

审稿意见
3

This paper studies how to perform continual learning with a pretrained model using the Parameter-Efficient Tuning module. It proposes some techniques (direct rematching + confidence rematching) to improve the task identity prediction when selecting PET task module during inference. It also proposes a distillation loss for training to encourage the learn of shared knowledge among different task PET modules.

The paper presents an extensive evaluation of the proposed method, compared against several baseline methods, on four datasets and different pretrained models. It also presents ablation studies to examine the effectiveness of each algorithm component as well as key hyperparameters.

优缺点分析

Strengths

  • The idea of direct rematching (section 3.2.1) is simple and effective in improving task identification prediction.
  • The paper includes an extensive comparison with baseline methods on four datasets with different pretrain models.
  • The paper provides ablation studies to investigate the effectiveness of the proposed three components,
  • The paper analyzes the effect of some of the key hyperparameters.

Weaknesses

  • Confidence plays an important role in the proposed method, which is used in both rematching and distillation. The authors mentioned the use of generalized entropy to compute confidence. But in the appendix, it seems that different confidence functions are used for different datasets (maxlogits for CIFAR100). It is unclear how to select the confidence measure for different datasets. It is better to include a systematic study on the effect of the confidence measure.

  • The proposed method significantly increases inference time. Although the authors mentioned the absolute increase is small (0.4ms), the relative inference time is about 15% (from 2.8 to 3.2 ms). The inference time cost is already considered a big and important problem when employing foundation models in practice.

  • The proposed method introduces many hyperparameters, e.g.γ\gamma, MM, τ\tau, N, K, λ\lambda. How to select a suitable combination of these hyperparameters for a dataset is non-trivial.

  • There is no error bar or confidence interval or standard deviation in the ablation studies, e.g. Table 3, 4 and Figure 6.

问题

  1. The confidence is a key component of the proposed method and is used in both the rematching step and distillation step (e.g. Eq 3 and 7). To compute confidence, the paper mentioned the use of the generalized entropy function with additional hyperparameters gammagamma and MM. Why not use other hyperparameter-free confidence-related measures, like predicted probability or entropy? How are gammagamma and MM selected? What is the interplay between generalized entropy function hyperparameter and the threshold parameter?

  2. The paper mentioned that the proposed method can be applied to different PETs. Any results to support this claim?

  3. As the proposed method involves many algorithm-specific hyperparameters, how are these algorithm-specific hyperparameters selected? What about baseline methods?

局限性

yes, but a bit short.

最终评判理由

The paper is technically sound. I acknowledge the strengths of paper mentione by other reviewers. However, due to increased inference time, the limited theoretical depth, and the introduction of algorithm-specific hyperparameters, I maintain my initial rating of 3 at this stage.

格式问题

No.

作者回复

Thank you for your insightful comments and questions.

Q1: A systematic study on the effect of the confidence measure

A1: To explore the effect of the confidence measure, we conduct ablation experiments on different post-hoc confidence measures in the table.

  1. Overall, our method is not sensitive to different confidence calculation methods.

  2. Generalized entropy (GEN) performs best across most settings. This is attributed to its design for semantic shift scenarios, i.e., detecting inputs with semantic categories that are absent from the training set [48]. In CL, the PET parameters for task tt are only trained on the semantic categories of task tt. When predicted task identity t^\hat{t} does not match the true task tt for sample xx from task tt, the semantic class of xx is absent from the training set in task t^\hat{t}. Therefore, GEN is more advantageous in detecting mismatched samples.

  3. On CIFAR-100, performance is near the upper bound with few mismatched samples. MaxLogit with a high threshold effectively filters mismatches, so we adopt it for better performance.

ConfidenceSup-21K (CIFAR-100)iBOT-21K (CIFAR-100)Sup-21K (ImageNet-R)iBOT-21K (ImageNet-R)Sup-21K (ImageNet-A)iBOT-21K (ImageNet-A)
MSP89.2389.2073.7374.6544.1240.59
MaxLogit89.4589.7073.7774.9744.2040.60
Energy89.2489.2973.6075.1244.1440.73
GEN89.3589.5573.8675.2344.2840.88

Q2: Increasing inference time

A2: Although the inference time is relatively increased, we achieve state-of-the-art performance across multiple datasets and pretraining settings. We acknowledge its practical significance and investigate the accuracy-efficiency trade-off in Section 4 of the supplementary material. In the future, further improving efficiency remains a promising direction.


Q3&Q7: Selection of hyperparameters

A3&A7: Our method is robust to most hyperparameters, and their selection is generally straightforward:

  1. γ\gamma and MM are from generalized entropy (GEN). According to GEN paper, we set γ\gamma in (0, 1). MM represents the top-MM highest probability in the distribution. Given that the minimum number of classes encountered during the incremental process is 20 (in the second task) for all datasets, we set it to 20.
  2. Given the inference cost, NN is set to 2 for all datasets. As shown in Section 4 of the supplementary material, NN governs the trade-off between accuracy and efficiency. NN can be adjusted based on available deployment resources.
  3. KK and λ\lambda in CTIRD are 5 and 0.2 for all datasets. We use common settings for KK, i.e., top-5. In Eq. 8, LCTIRD\mathcal{L}_{\mathrm{CTIRD}} is the sum of KK KL divergence losses in knowledge distillation. Drawing from prior work such as IRD [7], where the coefficient for a single distillation step is typically set to 1, we scale the overall loss by 1/K1/K, resulting in λ=1/K=0.2\lambda = 1/K = 0.2. The experiment in Fig. 6 verifies the effectiveness of our choice.
  4. The τ\tau affects the accuracy of mismatch detection. We search for the optimal value on the validation set. Overall, our method is robust to τ\tau. As shown in Fig. 6 (c), treating all samples as mismatches with τ=103\tau=10^3 still yields an improvement of 0.77 compared to selecting none (τ=103\tau=-10^3).
  5. For the baseline methods, we employ the same selection strategy with identical hyperparameters to ensure a fair comparison.

Q4: The standard deviation in the ablation studies

A4: As shown in the three tables below, we supplement the standard deviations of three random seeds for some experiments in Tab. 3, 4, and Fig. 6 (a). The conclusions drawn remain consistent with those in the initial draft. We will include all standard deviations in the revision.

Tab. 3

MethodSup-21KiBOT-21K
Baseline71.47 ± 0.2673.49 ± 0.31
Baseline + DRM72.55 ± 0.1074.11 ± 0.22
Baseline + DRM + CRM73.38 ± 0.2174.80 ± 0.15
Baseline + DRM + CRM + CTIRD73.86 ± 0.1475.23 ± 0.21

Tab. 4

DistillationCIFAR-100 (Sup-21K)(CIFAR-100 iBOT-21K)ImageNet-R (Sup-21K)ImageNet-R (iBOT-21K)
Logits87.70 ± 0.1488.10 ± 0.2072.79 ± 0.2874.32 ± 0.33
Features87.29 ± 0.2283.78 ± 0.1871.19 ± 0.1671.66 ± 0.13
IRD*88.61 ± 0.1588.86 ± 0.2673.22 ± 0.1874.68 ± 0.30
CTIRD89.45 ± 0.2389.70 ± 0.1373.86 ± 0.1475.23 ± 0.21

Fig. 6 (a)

PTM0.00.10.20.3
Sup-21K73.38 ± 0.2173.40 ± 0.1973.86 ± 0.1473.18 ± 0.23
iBOT-21K74.80 ± 0.1575.08 ± 0.2075.23 ± 0.2174.96 ± 0.17

Q5: The hyperparameters of generalized entropy

A5: Considering advanced performance and advantages in detecting semantic drift, we choose GEN as the main confidence function. According to GEN paper, we set γ\gamma in (0, 1). MM represents the top-M highest probability in the distribution. Given that the minimum number of classes encountered during the incremental process is 20 (in the second task), we set it to 20. GEN’s hyperparameters influence the range and distribution of confidence scores, which in turn affect the optimal threshold. By fixing γ\gamma and MM, we ensure consistency in the confidence scale, enabling a more stable threshold.


Q6: Results of different PETs

A6: As shown in the table, using prompt as PET, we show the ANA_N on two different pre-trained weights on the ImageNet-R dataset compared to the state-of-the-art prompt-based method. Our method HRM-PET still achieves state-of-the-art performance.

MethodiBOT-21KDINO-1K
CPrompt64.64 ± 0.8768.25 ± 1.43
HiDe-Prompt70.83 ± 0.1768.11 ± 0.18
HRM-PET (Prompt version)72.11 ± 0.1969.55 ± 0.28

评论

Thank you for your detailed rebuttal. The analysis of the confidence measures and hyperparameters is particularly reassuring.

After revisiting the paper and considering the other reviewers' comments, I’d appreciate the authors' insights on the following:

  • ​​Blurry Class-Incremental Learning​​: Could the algorithm be adapted to a blurry class-incremental setting where class labels may overlap across tasks?

  • ​​Label Noise Sensitivity​​: How does label noise in the training data affect performance? Could the rematching scheme potentially amplify the impact of label noise?

In addition, regarding the impact of the proposed method, it would be better to mention explicitly in the paper that there is a line of PET-based CL works (e.g. LAE, inferLoRA) that do not maintain task-specific parameters and thus do not require task identification in the first place.

评论

We sincerely appreciate your insightful comments and will add the following discussion to the revision.

Q1:​​Blurry Class-Incremental Learning​​

A1: Our algorithm can be applied to a blurry class-incremental learning scenario with some minor modifications:

  1. In the blurry class-incremental learning scenario, some classes are associated with multiple task identities. However, in the initial task identity prediction of the baseline we follow, the model first predicts the class and then maps the class to a task identity with T\mathcal{T}, which conflicts with blurry class-incremental learning setting. To address this issue, we revise the initial task identity prediction by adopting the key-value mechanism from DualPrompt. Specifically, during training, a learnable key is maintained for each task, and the input image features are used as a query to retrieve the corresponding task identity through key-value matching.

  2. Based on key-value mechanism, we make the following modifications to our algorithm: For DRM, we use the final predicted features as query to retrieve t^_s\hat{t}\_{s}, and then determine whether t^_s\hat{t}\_{s} is equal to t^_f\hat{t}\_{f}. For CRM and CTIRD, key-value similarity is utilized to obtain the top-N or top-K task identities in Equation 5 and Equation 7.

As shown in the table, we present a ANA_N comparison of our method with the adapted version of HiDe-LoRA under blurry class-incremental learning scenario. We adopt the dataset split from [1] and iBOT-21K pretraining setting on ImageNet-R. * indicates that the key-value mechanism has been incorporated. As can be observed, our method also achieves the best performance.

MethodHiDe-LoRA*HRM-PET* (ours)
ANA_N64.9866.17

[1] Moon, Jun-Yeong, et al. "Online class incremental learning on stochastic blurry task boundary via mask and visual prompt tuning." Proceedings of the IEEE/CVF international conference on computer vision. 2023.


Q2:​​Label Noise Sensitivity

A2: As shown in the table below, we conduct experiments on Imagenet-R with iBOT-21K by injecting noise at different ratios. The label noise assigns {0%, 10%, 20%} samples of the dataset to other labels by a uniform probability. Our method still achieves the best performance under different noise ratios. This benefits from our approach of comparing confidence scores in CRM to determine whether detected mismatched samples require task identity replacement, which alleviates the interference of noise on mismatching detection with threshold filtering. Moreover, Instance relation distillation in CTIRD is more robust to noise than logits or feature distillation, effectively learning valuable knowledge from previous tasks.

Method0%10%20%
HiDe-LoRA73.4068.7063.55
HRM-PET (ours)75.2370.1364.81

Q3: Mention of PET-based CL works that do not maintain task-specific parameters

A3: Thank you for your helpful suggestion. We will add the necessary details in the revision.


评论

Dear reviewer,

Thank you once again for your insightful and constructive feedback on our manuscript and rebuttal. We truly appreciate the time and effort you have dedicated to reviewing our work and engaging in the discussion.

We have carefully considered each of your latest comments and have provided detailed responses. Following our discussion, reviewer b4of has raised the score from 2 to 4. We sincerely hope that our efforts also adequately address your concerns and contribute positively to your evaluation.

Since the author-reviewer discussion period has been extended until Aug 8, we are fortunate to have sufficient time to collaboratively discuss and refine the manuscript. We would greatly appreciate any additional feedback you may have. If you have any additional questions or require any clarifications, please do not hesitate to reach out to us.

评论

Dear authors,

Thank you for taking the time to address my questions and provide additional experimental results. I found the extension of your method to the blurry setting particularly interesting. The results under the noisy label setting also provide further evidence supporting the effectiveness of the proposed approach.

Overall, I am convinced that the method is technically sound, and I have no further technical questions at this stage.

However, due to the increased inference time, the introduction of several algorithm-specific hyperparameters, and the limited theoretical depth, I have decided to maintain my initial rating. That said, I acknowledge the strengths of the paper highlighted by other reviewers and would be comfortable with the paper going either way.

评论

Dear Reviewers,

Thank you all for the big efforts. Please check authors' rebuttal to see if your original concerns have been addressed, as well as if you have any follow-up questions to the authors.

Dear Authors: Please engage with our Reviewers during this discussion period.

Thanks a lot.

最终决定

This paper proposes Hybrid Re-matching with Parameter-Efficient Tuning (HRM-PET) for rehearsal-free continual class-incremental learning. The idea of integrating direct and confidence-based re-matching is technically sound and addresses an important challenge. Reviewers appreciated the clarity of the experiments, the practical utility of the approach, and the detailed rebuttal, which helped resolve most of initial questions.

At the same time, some concerns remain: increased inference time, additional algorithmic hyperparameters, and the lack of deeper theoretical justification, as well as the writing that could be further refined. While the rebuttal alleviated several issues, these points were not fully resolved.

Overall, this paper received 1 Accept, 2 Borderline Accept, and 1 Borderline Reject. The reviews were split: some upgraded to an acceptance recommendation after rebuttal, while others maintained reservations. Balancing the strengths and remaining weaknesses, this paper makes a meaningful empirical contribution to continual learning, though with limitations. This meta-review agrees on both the positive aspects and the negative aspects pointed out by the reviewers, and leans towards Accept this submission. The authors are encouraged to carefully address all concerns in their revision.