PaperHub
6.3
/10
Spotlight4 位审稿人
最低5最高7标准差0.8
7
7
5
6
3.8
置信度
正确性2.8
贡献度2.5
表达3.0
NeurIPS 2024

Saliency-driven Experience Replay for Continual Learning

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-08

摘要

关键词
Continual LearningTransfer LearningSaliency Prediction

评审与讨论

审稿意见
7

Inspired by some neurophysiological evidence, the author proposed a novel method for online continual learning dubbed as SER, which utilizes visual saliency to modulate the classification network so as to alleviate catastrophic forgetting.

Specifically, the network architecture follows a dual-branch design where a saliency prediction network and a classification network are trained collaboratively during continual learning.

The saliency-driven modulation is simply implemented by an element-wise multiplication of intermediate outputs from two branches.

The comparison experiments demonstrate that the proposed SER module could achieve SOTA results when combined with existing methods.

And the ablation studies further investigate other modulation strategies and demonstrate the robustness of SER against spurious features and adversarial attacks.

优点

(1) The manuscript is well-organised and the writing is good, making the proposed method easy to understand.

(2) The proposed SER method is simple in design but effective under various OCL settings.

缺点

(1) The novelty is incremental. A prior work[1] has introduced saliency into class incremental learning, which also shares a similar motivation that low-level visual saliency are stable during continual leanring which could help reduce forgetting.

(2) The claim that “SER is model-agnostic and can be used in combination to any continual learning method” (page 2, line 70) may not be suitable. SER in current version is not model-agnostic since the saliency encoder has to maintain a homogeneous structure with the classification network. In addition, SER may not be compatible with some methods like L2P [2] so it’s not appropriate to claim it could be combined with ‘any’ continual learning method.

(3) some typos should be corrected. One page 8, line 306, the punctuation between ‘training data’ and ‘but’ should be a ‘,’, not a ’.’. In the appendix, On page 16, line 566 and 567, a right bracket ‘))’ was missing.

References

[1] Liu X, Zhai J T, Bagdanov A D, et al. Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 23954-23963.

[2] Wang Z, Zhang Z, Lee C Y, et al. Learning to prompt for continual learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 139-149.

问题

Based on weaknesses of the manuscript, my questions are listed below:

Q1. What is major difference in novelty, motivation and contribution with prior efforts which also utilizes saliency in continual learning?

Q2. Is there any evidence that SER is still effective when modifying backbone to architectures other than ResNet-18?

局限性

As is suggested by the author, one limitation is that saliency encoder has to maintain a homogeneous structure with the classification network, which may potentially hinder its real-world application.

作者回复

We thank the reviewer for their insight. We first address the major weaknesses (indicated with W1 and W2) identified by the reviewer and then respond, point by point, to their raised questions (indicated with Q1 and Q2). We will also review the whole paper to correct the identified typos.

W1 - Novelty

We acknowledge that prior work, specifically Liu et al. (2024), has introduced the use of saliency for exemplar-free class incremental learning. However, as mentioned in our paper (lines 96-105), Liu et al. (2024) employ a static, pre-trained saliency detector, thus they do not demonstrate (hence do not use) the forgetting-free capabilities of saliency prediction, since it is not continuously trained. In contrast, our method continuously trains the visual saliency network, which reduces forgetting by adapting classification features to new data. Furthermore, SER provides a more flexible and generalizable saliency-classification paradigm that adapts to any dataset without external dependencies, as opposed to Liu et al. (2024), which, instead, requires the use of a pre-trained saliency detector trained on the same data distribution as the target data. We believe these enhancements represent significant contribution to ongoing research in class-incremental learning. Finally, Liu et al. (2024) is not suitable for online continual learning because it is trained for a large number of epochs (100) and uses 50% of the classes in the first task, distributing the remaining classes across subsequent tasks. In our online continual learning setting, where 100 classes are divided into 20 tasks, we found that it achieved very low performance ( around 6%) as reported in Tab. R1 in the pdf attached to the global rebuttal.

W2 - Model agnostic

It is correct that SER requires the saliency encoder to share the same architecture as the classifier. However, we want to clarify that this requirement does not compromise the model-agnostic nature of our approach. Unlike methods that depend on pre-existing saliency predictors (such as the work by Liu mentioned by the reviewer), SER does not necessitate frozen saliency models as it trains saliency prediction continuously alongside the classifier. This integrated learning strategy involves instantiating an additional instance of the classifier, attaching a decoder, and training the entire system (classifier and saliency predictor) jointly. This can be done given that forgetting-free nature of saliency prediction. This learning design ensures that SER remains adaptable to various model architectures, preserving its model-agnostic characteristic while providing an effective solution for simultaneous saliency prediction and classification.

As for the compatibility of SER with various continual learning methods, the reviewer correctly points out that SER may not align well with some methods, like L2P, and this is a crucial distinction to address. Prompt learning approaches, such as L2P, operate under a different paradigm where the encoder remains static during continual learning. These methods leverage a pre-trained encoder's global knowledge of the visual world, using learnable prompts to identify the relevant sections of the huge latent space that address specific tasks. This is fundamentally different from the approach taken by SER. SER is tailored for scenarios where models undergo continuous training and learning from scratch. SER design ensures that classification features are dynamically modulated by saliency features throughout the learning process. Such dynamic modulation is essential for the model to adapt continually to new tasks and visual stimuli, which is critical for maintaining high performance and robustness in ever-changing environments. Training from scratch allows the model to develop hierarchical feature representations from the ground up, closely mimicking the visual cortex's processing mechanisms.

Thus, while the reviewer is correct that SER cannot be applied to every continual learning method, it is indeed well-suited for models which are trained from scratch. Accordingly, we will reformulate the claims about the applicability of SER to every continual learning method.

Response to Questions:

Q1 - What is the novelty?

Please refer to our response to W1.

Q2 - Generalization of SER to architectures other than ResNet-18

We extended our experiments to include additional backbones such as ResNet50, MobileNet V2, and DenseNet-121 (see Table R2 of the attached pdf). In all cases, SER improves the performance of the baseline, whether the classifier C is trained from scratch or with the same weights as the encoder of the saliency predictor S (we replicate the same setting as in Table 1 of our original manuscript). This demonstrates that SER remains effective across different architectures.

评论

After reading the rebuttal and its attachment, I find out that my major concerns have been well addressed. The novelty against prior works has been clarified clearly by the author, and the additional experiments also demonstrate that SER is compatible with more network architectures. As a result, I'd like to raise my final rating as 'Accept'.

审稿意见
7

In this paper, the authors propose Saliency-driven Experience Replay (SER) a biologically-plausible approach based on replicating human visual saliency to enhance classification models in continual learning settings. More concretely, they propose to employ auxiliary saliency prediction features as a modulation signal to drive and stabilize the learning of a sequence of non-i.i.d classification tasks. SER is model-agnostic and can be used in combination to any continual learning method. The authores demonstrate that saliency modulation positively impacts classification performance in online continual earning settings, leading to a significant gain in accuracy (up to 20%). Furthermore, they show that saliency modulation leads to learn saliency-modulated features that are more robust to the presence of spurious features and to adversarial attacks.

优点

  • The paper is scientifically motivated, presenting a biologically plausible approach
  • The paper is clearly written and well documented
  • The related work section most of the relevant work
  • The experimental validation is extensive and it demonstrates the improvements introduced by the current approach

缺点

  • Some aspects need to be explained more in detail

问题

Here are my concerns:

  • In the following statement (lines 138-139): '... for instance, class labels from D_iD\_i might be different from those from D_jD\_j , though both must belong to the same domain YY'. The most common setting in continual learning literature considers that the tasks D_iD\_i and D_jD\_j are disjoint. According to your statement, it could be understood that they might not be. Only recently, in the reference below, the authors considered a novel setting, 'Class-Incremental with Repetition', which opens up the possibility that the tasks are not disjoint. Do you consider also this possibility in your approach? Please provide more details on this aspect.

Hamed Hemati, Andrea Cossu, Antonio Carta, Julio Hurtado, Lorenzo Pellegrini, Davide Bacciu, Vincenzo Lomonaco, Damian Borth. Class-Incremental Learning with Repetition. CoLLAs 2023

  • Lines 323-326, when you talk about the robustness of your approach against adversarial perturbations. Could you better explain the content of figure 4? What means SAM-ER-ACE (orange line) means? To which dataset these curves belong: Split Mini-ImageNet or Split FG-ImageNet?
  • A similar question for Table 3: to which dataset these values correspond: Split Mini-ImageNet or Split FG-ImageNet?

局限性

One limitation identified in the paper is related with SER: although it is model-agnostic, its formulation necessitates that the saliency encoder and the classifier share identical architectures. The current work does not have any negative societal impact.

作者回复

We thank the reviewer for the provided comments. In the following we respond to the raised questions (indicated with Q1, Q2, and Q3).

Q1 - Class-Incremental with Repetition

To clarify, in our approach, the tasks DiD_i and Dj D_j are indeed considered disjoint. We acknowledge that our current description could be interpreted as suggesting otherwise, which is not our intention. We will amend the paper to explicitly state that tasks DiD_i and DjD_j are mutually exclusive within the same domain YY. The ‘Class-Incremental with Repetition’ setting represents a different scenario and could potentially simplify the problem. We recognize this as an interesting direction for future testing of our SER method (and we will include it in our paper), but our current focus remains on the disjoint class-incremental setting, which poses a more challenging problem.

Q2 - Clarification to Figure 4

We applied the Projected Gradient Descent attack ( [38] of the manuscript) to the ER-ACE + SER model, referred to as SER-ER-ACE and represented by the orange line (the label SAM-ER-ACE was a typo and will be corrected in the final manuscript) on the Split Mini-ImageNet dataset. We compared its performance against the baseline ER-ACE (green line). The figure shows the accuracy drop (in %) compared to the standard training performance (values reported in Table 1 of the manuscript) as the attack intensity ϵ\epsilon increases. It is evident that the model equipped with SER experiences a significantly smaller drop and is better able to tolerate this attack.

Q3 - Clarification to Table 3

The values presented in Table 3 of the manuscript correspond to our custom benchmark using MiniImageNet. Specifically, we crafted this benchmark to evaluate the robustness of the SER strategy against spurious features and adversarial attacks. In this benchmark, we selected the first ten classes from MiniImageNet, organized into 5 tasks with 2 classes per task, and introduced spurious features by modifying the brightness of the training images with a class-dependent offset. The test images remained unaltered.

Regarding the reviewer’s concern about the requirement for the saliency encoder and classifier to have the same architecture, we would like to clarify that this does not compromise the model-agnostic nature of our approach. The saliency predictor in the SER strategy does not need to be an external model, but it can be built by creating an additional instance of the classifier, attaching a decoder, and jointly training it with the classifier. Thus, the architecture alignment is not a limitation, as the saliency predictor can be directly constructed using the continually trained classifier.

评论

I want to thank the authors for addressing all my concerns.

审稿意见
5

This paper draws inspiration from neurophysiological evidence and proposes a biologically-inspired saliency-driven modulation strategy named SER to mitigate catastrophic forgetting in online continual learning. SER works by regularising classification features via predicted saliency and is comprised of a classification encoder and a saliency encoder deployed in parallel. The proposed method produces superior results on image classification in both class-incremental and task-incremental settings compared to other biologically-inspired or attention-based solutions. Besides, it has also been shown to be more robust in detecting spurious features or adversarial attacks.

优点

  1. The paper is well-written and easy to follow, with well-justified motivation and nice visuals.
  2. The idea of using forgetting-free saliency to combat forgetting in online continual learning is very simple and easy to implement. It has also been shown to be effective, as the proposed method improves different existing methods by notable margins under different settings.
  3. In addition to task performance, the authors examined the proposed method's computational cost, which suggests that SER is deployment-friendly.
  4. The authors also made a good attempt to delineate their motivation from a biological and neurophysiological perspective.

缺点

  1. The authors conduct all experiments on ResNet-18 models. It is not clear if the proposed method generalizes to other models as well. Since the authors claimed their method's generalization property as an advantage, experiments on different datasets, tasks (e.g., segmentation, detection), and, most importantly, different architectures (such as ViT) are missing.
  2. Given 1), the authors fail to present results or investigation on cases when a) both the saliency encoder and the classifier encoder are ViT b) one of them is CNN-based and the other is transformer-based. This weakness is also acknowledged in the last sections of the manuscript.
  3. The method can be explained in greater detail. Currently, the authors spend about 1 page describing their own method out of 9 pages of the main text. While this is also due to the method's simplicity, I would suggest expanding this part by adding more detailed descriptions.

问题

  1. In Figure 3, it is not clear what “SAM” refers to. Also, in Figure 4, why is the proposed method dubbed “SAM”?
  2. How sensitive is SER’s performance to other hyperparameters involved? More ablative experiments on the hyperparameters are beneficial.
  3. The authors argued that SER generalizes to networks of different architectures. It would be interesting and perhaps important to back their claim with experiments on Transformer models.
  4. The authors limited the scope of their problem from the very beginning of the paper to image classification tasks. The authors are encouraged to explain whether their method works for tasks such as semantic segmentation and object detection and whether these tasks are considered in prior works.

局限性

Yes

作者回复

We thank the reviewer for their insight. We first address the weaknesses (indicated with W1, W2, and W3) identified by the reviewer and then respond, point by point, to their raised questions (indicated with Q1, Q2, Q3, and Q4).

W1 - Generalization to other models: To address this concern, we have extended our experiments to include additional backbones beyond ResNet-18. Specifically, we have tested our SER strategy with ResNet-50, MobileNet V2, and DenseNet-121. As reported in Table R2 in the PDF attached to the global rebuttal, in all cases, our SER approach leads to improved performance, thereby demonstrating its effectiveness across various architectures. Regarding the ViT backbone, please refer to our next response (to W2) where we provide both conceptual and practical reasons for excluding ViT from our analysis.

W2 - Generalization to transformer models: SER focuses on emulating visual cortex mechanisms, particularly object recognition via selective attention, which aligns more closely with Convolutional Neural Networks (CNNs) due to their similarity to the visual cortex’s hierarchical and localized processing. CNNs have local receptive fields and hierarchical stages similar to those in the primate visual cortex, as detailed by Hubel et al. (1962) and DiCarlo et al. (2007). Empirical studies show CNNs’ effectiveness in predicting neural responses and modeling visual encoding in primates (Yamins et al., 2016; Cadena et al., 2019), reinforcing their suitability for selective attention and low-level visual processing.

While Vision Transformer (ViT) excels in capturing global context and dynamic attention, it does not align with the hierarchical, localized processing of CNNs. Additionally, ViT requires large amounts of data and extensive training. Our experiments showed that ViT achieved only 1% accuracy on the Split-MiniImageNet benchmark, regardless of whether trained for 1 epoch or 50 epochs, underscoring its limitations in the context of online continual learning.

Hybrid solutions that combine CNN-based and Transformer-based models pose significant challenges due to the need for aligned semantic representation of features. SER relies on modulating classification features with saliency features. This modulation requires that the semantic representation of these features be coherent and aligned. Using hybrid solutions can disrupt this alignment, as CNNs and Transformers process and represent features differently.

Thus, our preference for CNNs over ViT and hybrid models is driven by the need to closely mimic visual cortex processing and by empirical evidence of low performance in OCL by ViT. We will revise our claims on SER’s applicability to continual learning methods.

  • Hubel et al., 1962. “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” Journal of Physiology.
  • DiCarlo et al., 2007. “Untangling invariant object recognition,” Trends in Cognitive Sciences.
  • Yamins et al., 2016. “Using goal-driven deep learning models to understand sensory cortex,” Nature Neuroscience.
  • Cadena et al., 2019. “Deep convolutional models improve predictions of macaque v1 responses to natural images,” PLoS Computational Biology.

W3 - More detailed description of the method: We acknowledge the request for a more detailed description and will provide additional details in the revised manuscript.


Responses to Questions:

Q1 - SAM: It was a typo that will be corrected in the final version of the paper.

Q2 - Sensitivity of SER to hyperparameters: The SER hyperparameters include: 1) the λ\lambda term that weights the loss component in Equation 3 of the manuscript, 2) the various saliency modulation strategies (ranging from saliency provided as input to the model, to layerwise summation, to layerwise multiplication), and 3) the specific layers to which SER modulation is applied. The latter two aspects have been thoroughly evaluated in the paper (see Fig. 3 and Table 2). Regarding the λ\lambda term, we tested several values (λ=0.1,0.5,1,1.5,2\lambda = 0.1, 0.5, 1, 1.5, 2). Although the model's performance was relatively stable across these values, the best results were achieved with λ=1\lambda = 1. Additionally, the PDF attached to the global rebuttal details the various tested hyperparameters (Table R3) for the methods reported in Table 1 of the manuscript.

Q3 - Generalization to transformer models: Please refer to our response to W2.

Q4 - Limited scope to image classification: Our approach specifically addresses the problem of class-incremental learning (CIL) that targets image classification within continual learning settings. This particular focus aligns with a substantial body of existing literature and remains an unsolved challenge in the field. Indeed, for class-incremental continual learning, there are well-established frameworks such as Mammoth and Avalanche and widely recognized benchmarks, which we have employed in our work.

While dealing with continuous training for tasks like semantic segmentation and object detection is an important and related area of research, it falls under other broader categories (e.g., incremental object detection - Liu et al. 2023 - and incremental semantic segmentation - Michieli et al. 2019). These tasks have different baselines, methods, and benchmarks. Expanding our approach to include these other tasks would require a separate and comprehensive investigation.

However, class-incremental learning can be adapted to object detection, for instance, in the region proposal step, while it is less trivial to use in semantic segmentation. Nonetheless, we acknowledge the importance of extending continual learning methodologies to other tasks and consider it a valuable direction for future research.

  • Liu et al., 2023. Continual Detection Transformer for Incremental Object Detection. CVPR 2023.
  • Michieli et al., 2019. Incremental learning techniques for semantic segmentation. ICCV 2019.
审稿意见
6

The paper propose to use saliency prediction features as a guidance to stabilize training in online continual learning settings. The method is motivated by the observation that saliency detection remains stable with training over new tasks continually. The proposed SER method is model-agnostic and improves significantly over baseline methods. Interestingly, the method enables training of models which are more robust to adversarial attacks.

优点

  1. The paper has insightful experiments to form the motivation. The distinction between saliency maps and attention maps is helpful.
  2. The method explanation and illustration is clear and to the point.
  3. The experiments (combining with different methods) and extensive ablations are appreciable.
  4. The analysis showing robustness of the model adds more value to the method.

缺点

  1. Fig 1. Caption “activation maximization maps via GradCAM, which are prone to catastrophic forgetting due to their dependence on the classifier”: I am not sure if this statement is true. This should be discussed in more details. Existing Attribution methods like GradCAM, Integrated Gradients can also be computed for different layers of the network and is not only for the classifier. It would be good to investigate how methods like LayerIntegratedGradients [1] work here, is that still dependent on the classifier? Layer Attributions has also been explored for continual learning [2] recently. While the authors discuss methods using attention maps for future replay, it is also relevant to discuss methods like [2] which uses attributions for weight transfer/transfer learning.

  2. It would be good to include more recent methods in comparison. The compared methods for OCL although important baselines, are not very recent ones.

[1] https://captum.ai/api/layer.html

[2] Goswami, Dipam, et al. "Attribution-aware weight transfer: A warm-start initialization for class-incremental semantic segmentation." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.

问题

See weaknesses

局限性

Addressed in paper.

作者回复

We thank the reviewer for the provided comments. We here address point by point the weaknesses (indicated with W1 and W2) identified by the reviewer.

W1 - Activation maps per layer

We have tested GradCAM at different layers of the network, as well as LayerIntegratedGradients, as suggested. Our results, shown in Figures R1 and R2 of the attached PDF to the global rebuttal, reveal that activation maps from deeper classification layers, as well as intermediate and lower layers, degrade as the model is trained continuously. This indicates that catastrophic forgetting affects not just the classifier, but multiple layers of the network. Using LayerIntegratedGradients, we observed that the features learned by the network tend to change from task to task. This change leads to the forgetting of initially learned classes and concepts, demonstrating that this method also depends on the classifier and is susceptible to forgetting.

As for the significance of layer attributions in continual learning, we would like to clarify the differences between our method and the one suggested (and indicated with [2]) by the reviewer. SER is specifically designed for class-incremental learning in image classification. It employs saliency prediction techniques to modulate classification features during training. We have demonstrated that saliency prediction remains robust during training and is less prone to degradation, ensuring better performance and stability over time.

In contrast, the method described in [2] addresses class-incremental semantic segmentation (CISS) and focuses on forgetting in semantic background shift. Their approach introduces a novel classifier initialization technique that uses gradient-based attributions to identify and transfer relevant weights for new classes, specifically to address background shift in segmentation tasks.

The key difference lies in our use of saliency prediction techniques versus [2]’s use of gradient-based attribution for weight transfer. Our approach does not rely on attribution techniques, which we have shown to degrade during training, but rather on saliency prediction, which is resilient to forgetting.

However, we appreciate the opportunity to clarify these differences and we will include them in our revised manuscript, acknowledging [2]'s contribution to the field and discussing its relevance to our work.

W2 - Comparison with more recent OCL methods

We have extended our comparison to include more recent methods (see Table R1 of the pdf attached to the global rebuttal). Specifically, we have included several existing methods already integrated within the Mammoth continual learning framework (Boschini et al, 2022). This framework offers a common and consolidated benchmark for online and offline continual learning methods, which simplifies the comparison of performance and the assessment of each method’s contribution.

Among the tested methods, there are very recent approaches specifically designed for Online Continual Learning (OCL), such as PEC ([R1] in the global rebuttal), and OnPro ([R13] in the global rebuttal). Results show that methods trained with the SER strategy outperform both their counterparts without SER and existing methods by several percentage points, demonstrating the effectiveness of our SER strategy.

  • Boschini et al, 2022. “Class-Incremental Continual Learning into the eXtended DER-verse,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
评论

My concerns are addressed in the rebuttal. I appreciate the comprehensive response from the authors in clarifying the concerns and adding more recent baselines for a fair comparison. After reading all the reviews and the authors response, I improved my rating to 6.

作者回复

We appreciate the feedback from all reviewers and provide an overview of our responses to the major concerns raised. Detailed responses are addressed individually for each reviewer.

Reviewer U9b5

1. Activation Maps per Layer (W1): We have investigated per-layer activation maps using GradCAM and LayerIntegratedGradients. Our results, illustrated respectively in Figure R1 and Figure R2 of the attached PDF, show degradation in activation maps across layers as the model trains, indicating widespread catastrophic forgetting.

2. Comparison with Recent OCL Methods (W2): We have extended our comparison to include recent methods like PEC ([R1]) and OnPro ([R13]) within the Mammoth framework (Table R1 of the attached PDF). Results show that methods trained with the SER strategy outperform both their counterparts without SER (as already presented in the paper Table 1) and existing methods by several percentage points, validating the effectiveness of our approach also compared to recent OCL strategies.

Reviewer L6ph

1. Generalization to Other Models (W1): We extended experiments to ResNet-50, MobileNet, and DenseNet, showing that SER improves performance across multiple architectures. The exclusion of Vision Transformers (ViT) is due to its poor performance in online continual learning scenarios, when trained from scratch, where it achieved only chance-level accuracy.

2. Generalization to Transformer Models (W2): SER is designed to emulate visual cortex mechanisms, aligning with CNNs due to their hierarchical processing, supported by empirical studies. Vision Transformers, though very effective, do not fit our model's objectives due to their different processing characteristics and data needs. CNNs are more suitable for our goals of hierarchical, localized processing.

3. More Detailed Description of the Method (W3): We acknowledge the request for a more detailed description and will provide additional details in the revised manuscript.

Reviewer H211

1. Class-Incremental with Repetition (Q1): We clarify that tasks DiD_i and DjD_j are disjoint and will update the paper to reflect this explicitly. The 'Class-Incremental with Repetition' scenario is an interesting direction for future testing of SER, but is outside the scope of our current study that, instead, targets a more challenging setting.

2. Clarification to Figure 4 and Table 3 (Q2 and Q3): Figure 4 shows the performance of ER-ACE+SER on the Split Mini-ImageNet dataset to adversarial attacks. Table 3, instead, shows the robustness of ER-ACE+SER to spurious features using a custom benchmark, constructed in a way to enforce the presence of these spurious features.

3. Saliency Encoder and Classifier Architecture (Q4): The alignment of the saliency encoder with the classifier architecture does not compromise the model-agnostic nature of SER. The saliency predictor can be built as an instance of the classifier with an added decoder, allowing continuous training and adaptability of the two paired networks.

Reviewer oLZW

1. Novelty (W1): While Liu et al. (2024) use a static saliency detector, our method continuously trains the saliency network, which reduces forgetting and adapts to new data. SER's flexibility and generalizability, without reliance on pre-trained detectors, represent significant contributions. Moreover, Liu et al.’s method (TASS), trained for multiple epochs, did not perform well on our benchmarks (see Table R1 in the attached pdf).

2. Model Agnostic (W2): SER requires the saliency encoder to share the same architecture as the classifier, but this does not compromise its model-agnostic nature (please refer to our above response to Q4 by Rev. RH211). Unlike methods that depend on static encoders (e.g., L2P), SER dynamically trains both components, making it adaptable to various model architectures. We will adjust our claims to reflect SER’s applicability to models trained from scratch.

In the following, we report the references for the methods used for comparison in Table R1 of the PDF attached to this rebuttal.

  • [R1] Zajac et al. Prediction error-based classification for class-incremental learning. ICLR 2024.
  • [R2] Liu et al. Task-adaptive saliency guidance for exemplar-free class incremental learning. CVPR 2024.
  • [R3] Riemer et al. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. ICLR 2019.
  • [R4] Chaudhry et al. Efficient Lifelong Learning with A-GEM. ICLRW 2019.
  • [R5] Wu et al. Large scale incremental learning. CVPR 2019.
  • [R6] Benjamin et al. Measuring and regularizing networks in function space. ICLRW 2019.
  • [R7] Lopez-Paz and Ranzato. Gradient episodic memory for continual learning. NIPS 2017.
  • [R8] Prabhu et al. GDumb: A simple approach that questions our progress in continual learning. ECCV 2020.
  • [R9] Aljundi et al. Gradient Based Sample Selection for Online Continual Learning. NIPS 2019.
  • [R10] Rebuffi et al. iCaRL: Incremental classifier and representation learning. CVPR 2017.
  • [R11] Hou et al. Learning a unified classifier incrementally via rebalancing. CVPR 2019.
  • [R12] Pernici et al. Class-incremental learning with preallocated fixed classifiers. CVPR 2021.
  • [R13] Wei et al. Online prototype learning for online continual learning. ICCV 2023.

All references to tables and figures come with the suffix R to avoid any confusion with those reported in the manuscript.

最终决定

This paper proposes a saliency-guided approach to online continual learning. They authors propose to leverage saliency-based features that remain more stable across incremental learning tasks. Reviewer opinions were initially positive, although reviewers raised concerns over missing comparisons with state-of-the-art saliency-based approaches and clarity/thoroughness of technical presentation and descriptions. The authors provided a comprehensive and convincing rebuttal that addressed all major reviewer concerns. The authors are encouraged to incorporate the new comparisons and analyses provided in rebuttal in the final version of this work.