Continual Learning in Open-vocabulary Classification with Complementary Memory Systems
A responsive, flexible, efficient complementary learning system for open-vocabulary continual learning
摘要
评审与讨论
The paper proposes a continual learning method for image classification. It addresses quite challenging setups as data incremental, class incremental and task incremental settings. The core of their method is a combination CLIP embedding for zero-shot prediction and tree-based kNN exemplar-based model (TreeProbe). The paper presents results a large variety of datasets and compares with a few well-cited continual learning methods from the past decade. CLIP enables open-vocabulary learning (within CLIP training set, of course) and their suggested TreeProbe shows quick training and inference. The two-model combination is inspired by a famous work in continual learning "complimentary learning system" that suggests that human brain has two types of memories: fast episodic memory (hippocampus) and slow consolidating memory (neocortex). It's been exploited in continual learning approaches before with limited success.
优点
I find this paper quite strong. It tackles a very very difficult problem of open-world image classification in incremental learning. Typically neural nets suffer from catastrophic forgetting that makes incremental training pointless pretty much. Many attempts were made on solving this problem.
Overall results look very promising. I am happy to not see permuted MNIST in benchmarks.
The paper is well-written and easy to read. The paper achieves SOTA results although the gap with the other CLIP-based model is pretty small. I found it convincing enough and sufficiently novel.
缺点
Using CLIP for incremental learning isn't novel. TreeProbe seems like a quite simple approach, I can't believe it wasn't described before. I couldn't however find a reference.
The main drawback is that it still relies on CLIP. While we don't know exactly what CLIP was trained on, I find it easy to believe that 400M dataset contains everything that the authors used for evaluation so in a way it is not really continual learning as we would like it to be.
问题
The method contains a lot of moving parts around merging CLIP and exemplar model and TreeProbe implementation. I encourage the authors to release their code to facilitate future research.
I also suggest to clearly present accuracy of supervised baselines for every dataset they used. While it is certainly not to compare their method against, it is useful to know "are we there yet?" in terms of how practical continual learning has become. It is also useful to know total number of classes that was obtained after merging all datasets.
We appreciate the encouraging comments that the paper is quite strong in tackling a very challenging problem with very promising results and realistic experiments. We also are glad that you find the current version well-written and easy to read, though we will improve further based on your and other reviewers’ comments. Thank you for the opportunity to respond to your comments and concerns.
References to TreeProbe: We also could not find specific references to TreeProbe, though again we note its strong relationship to the well-established idea of Lazy Learning. It seems an idea worth discovering or resurfacing, as the case may be.
Dependence on CLIP: We agree that "zero-shot" loses some of its original meaning when using models that have been trained on Internet-scale data. However, the ability to handle arbitrary label sets using language-image embeddings is certainly useful, and the Radford et al. CLIP paper investigates (Sec. 5) and demonstrates (Fig. 17) that overlap with datasets accounts for less than 1% of zero-shot test accuracy. The relative performance between the original CLIP model(s) and our method is informative and controls for any data leakage that CLIP may exploit.
Code release: If accepted, we will release our code for our method and evaluation well before the conference.
Supervised Baselines: We’ll extend Table 2 in supplementary with per-task linear probe and fine-tuned baselines. We will also add the number of classes (total is 1034), and performance of a fine-tuned model on all data.
Dear Reviewer,
The author has provided responses to your questions and concerns. Could you please read their responses and ask any follow-up questions, if any?
Thank you!
Thank you for your response, these are fair points. It is plausible that CLIP dataset doesn't affect zero-shot performance that much although Fig. 17 shows that it does for some datasets.
The approach enables adaptable and efficient continual learning in open-vocabulary image classification. It draws inspiration from human cognition’s complementary learning systems. In this work, the author merges predictions from a CLIP zero-shot model and an exemplar-based model. it is based on using the zero-shot estimated probability that a sample’s class is within the exemplar classes. Inspired by lazy learning principles, the author introduces a “tree probe” method, which facilitating rapid learning from new examples with comparable accuracy to batch-trained linear models.
优点
[1] The problem is interesting, which predicts the open set vocabulary while following the setting of continual learning.
[2] Results are evaluated over the various datasets in the diverse setting.
[3] “Tree-probe” is interesting, which balance the rapid learning and performance.
缺点
[1] There are various setting in the continual learning (task incremental, class incremental, data incremental etc.) and zero-shot learning (generalized/non-generalised setting). The exact experimental setting, evaluation strategy, and the motivation of each setting are not clear. It’s difficult to follow the section 4.2, there should me better illustration and discrimination between the various evaluation scenarios.
[2] There are few recent works [1,2] follow the similar setting. These works can be considered as baseline along with the CLIP zero-shot.
[3] I believe that adding the problem setting before the section-3 (Method) with the proper notation will increase the readability.
[4] The recent prompting based continual learning approach [3,4] leverages the strong pretrained model and shows the promising result for the continual learning without complementary memory system. Instead of examplar storage, if model leverages promoting based approach, how the model behaves?
Reference:
[1] Unseen Classes at a Later Time? No Problem, CVPR-22
[2] Meta-Learned Attribute Self-Gating for Continual Generalized Zero-Shot Learning, ArXiv-21
[3] DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning, ECCV-22
[4] CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning, CVPR-23
问题
Please refer to the weakness section.
We appreciate the encouraging comments that the problem and tree-probe approach are interesting and regarding the diverse experimental validation. Thank you for the opportunity to address your concerns.
Experimental settings: We agree to clarify and add detail in the main paper (Sec 4.1, 4.2) and supplemental (S-3) about experimental settings. We also clarify below and in our response to Reviewer ShHL.
- Our experimental settings are designed to represent different practical requirements of continual learning. In each case, labeled training examples are received in batches.
- Data incremental (Fig 2a): the batches are uniformly randomly sampled, reflecting a case when data is received, e.g. from users, and labeled.
- Class incremental (Fig 2b): each batch contains all samples of a particular category, reflecting a case where the developer intentionally finds and labels examples of new classes to increase capability.
- For data- and class-incremental, experiments are done individually for each task, and accuracy is averaged across tasks. The candidate label set for a task is constant, so accuracy is measured over all classes, even for classes that do not yet have training data.
- Task incremental (Fig 2c): A single system is trained for each task sequentially, and test accuracy is averaged across all tasks at each stage.
- Flexible inference (Fig 2d): In the task incremental setting, we also measure performance when the candidate label set does not match provided examples (zero-shot, averaging over three held-out tasks), or is the union of all tasks’ candidate labels and labels from zero-shot tasks (union + zero-shot), or is a subset of all candidate labels and labels from zero-shot tasks (mix+zero-shot). These are highly challenging scenarios that require retaining the ability to predict over arbitrary label sets, as well as the label sets that correspond to training examples.
Paper organization: We will consider moving the problem definition before Sec 3, as suggested.
References: Thank you for these references, which we will incorporate. Although [1,2] have similar goals, they use attribute-based representations, while we build on language-image representations. As mentioned to Reviewer ShHL, reported results in those papers and the Radford et al. paper show that CLIP achieves much higher zero-shot accuracy than these methods reach even after training. We confirmed this in early testing as well. This is also posted to Reviewer ShHL: In our own test, under the generalized zero-shot learning setting which requires mapping learned representations to a union of seen and unseen labels, CLIP obtains the seen and unseen accuracy of 91.7 and 94.7 on the AwA2 dataset, respectively, largely surpassing the results (seen: 60.2, unseen: 77.1) obtained by [B] after training.
Discussion to prompt-based methods: We agree that prompt-based methods are an important direction to explore in future work. Currently, such methods are suited mainly for task-incremental due to the large computation required. Generally, our AIM approach to combine systems can apply to any specialized model, so larger and more tuned models such as the prompt-based models will predictably improve accuracy (at cost to slower training). Table 2 shows that using a large CLIP model, for instance, increases target task accuracy from 80.5 to 90.2.
Dear Reviewer,
The author has provided responses to your questions and concerns. Could you please read their responses and ask any follow-up questions, if any?
Thank you!
The paper tackles a new problem of continual learning in open-vocabulary classification where the model can update its knowledge based on incoming new samples while preserving zero-shot learning capability. To achieve this, this works proposes to combine CLIP model, having zero-shot ability, with examplar-based learning, which stores additional training samples as exemplar for continual learning. For exemplar-based learning, the author introduces a tree-probed algorithm which improves upon KNN to increase accuracy. The paper provides an interesting analogy between CLIP & instance-based learning and fast & slow learning system in human development. To combine between these two different learning paradigms, the proposed work designs a fusing prediction formula that averages the probability or embedding predictions of the two models. Finally, Adaptive Instance Marginalization module is trained to estimate the probability of a test sample belong to the exemplar set to further boost the performance. The paper conducts experiments on CIFAR100, SUN397, FGVCAircraft, EuroSAT, OxfordIIITPets, StanfordCars, Food101 and Flowers102, ImageNet, UCF101, and DTD.
优点
- Using fast & slow learning system from human learning to describe the combination of zero-shot and exemplar-based learning is inspiring.
- The problem of continual learning in open-vocabulary classification is interesting and can have practical applications.
- The paper provides details experiments on multiple datasets.
缺点
-
The paper is hard to follow. Specifically the proposed setup is vaguely describe in the paper. For example, in the literature of open-vocabulary learning, there are only base and target classes [A,B]. However, the paper keeps mentioning about zero-shot and target tasks without clear explanation on what is zero-shot class (in open-vocabulary learning).
-
The proposed idea is highly similar to continual zero-shot learning work where the goal is also to update model with new samples while maintaining the zero-shot performance [B]. It seems that the main difference is the use of CLIP encoder which boosts the model zero-shot capability but the core idea of continuously update the zero-shot model is similar. However, there is no discussion or comparison with these prior works.
-
The reviewer also has doubted on the effectiveness of the proposed method as on average task performance the model only improve 0.5% compared the strong baseline ZSCL (as reported in table 1 in the main paper). Moreover, based on table 4, it appears that the proposed method doesn't perform well on fine-grained classification tasks of Flowers, Cars and EuroSat. Thus, the reviewer is not confident on whether the proposed method advances the continual open-vocabulary classification task.
[A] Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling, Huynh et al. CVPR22
[B] CLASS NORMALIZATION FOR (CONTINUAL)? GENERALIZED ZERO-SHOT LEARNING, Skorokhodov et al. ICLR21
问题
- Can the author verify the continual learning setup as well as the terminology (zero-shot tasks) clearly in the manuscript? This would significantly improves the paper readability.
- Sufficient discussion should be make between the proposed work and continual zero-shot learning literature.
- Can the author justify the modest improvement of 0.5% compared to SOTA?
We appreciate the encouraging comments that the direction is inspiring, problem is interesting and practical, and experiments are detailed. Thank you for the opportunity to address your concerns.
Experimental settings: We agree to clarify and add detail in the main paper (Sec 4.1, 4.2) and supplemental (S-3) about experimental settings. There are two branches of "open vocabulary" in the literature, one based on language embeddings and one based on attribute-category relationships. We clarify our meaning below and in response to Reviewer aTSX.
- Task: A task is defined by an image distribution and a set of candidate labels. The model’s input is an image and a set of candidate labels (without an explicit task identifier), and the output is one of the candidate labels.
- Target task: A task for which at least some training examples are received. We aim to improve performance on the original CLIP model for these tasks.
- Zero-shot task: A task for which no training examples are received. Prediction on zero-shot tasks is enabled using vision-language embeddings. We aim to retain performance from the original CLIP model on these tasks.
References: Thank you for these references. We'll incorporate them and others to improve our related work section. A major difference between many of the “continual zero-shot learning” or “generalized zero-shot learning” is that these methods, such as [B], rely on attribute-based representations, while we use image and language embeddings to predict novel/unseen labels. In reported results and our own tests, CLIP has much higher zero-shot accuracy (e.g. for the aPY and SUN datasets) than the continual attribute-based methods achieve even after training, supporting the idea of building on vision-language models for open vocabulary image classification. In our own test, under the generalized zero-shot learning setting which requires mapping learned representations to a union of seen and unseen labels, CLIP obtains the seen and unseen accuracy of 91.7 and 94.7 on the AwA2 dataset, respectively, largely surpassing the results (seen: 60.2, unseen: 77.1) obtained by [B] after training.
Comparison to ZSCL: We consider our performance with respect to the most closely related SotA ZSCL to be highly encouraging. Here’s why:
- Shown in Table 3, our linear probe outperforms by 1.2% in Transfer, 1.7% in Avg, and 2.4% in Last performance. Our TreeProbe method outperforms by 1.2% in Transfer, 0.5% in Avg, and 1.9% in Last.
- The Transfer performance reflects nearly perfect retention of zero-shot capability, which is also demonstrated in other experiments, while the forgetting experienced by ZSCL and WiSE-FT will tend to grow as more is learned due to their weight tuning strategies.
- Simplicity/robustness: Our method outperforms ZSCL on their own test setup, despite that:
- Their method fine-tunes; ours does not fine-tune.
- They tune learning rate parameters for each task individually; we do no hyperparameter tuning.
- Their method requires a large auxiliary dataset (ImageNet + ConceptualCaptions); we require no additional data.
Performance on some tasks: For particular tasks that especially benefit from fine-tuning, our system may underperform, but this can likely be mitigated in future work that explores developing stronger exemplar-based models while still limiting the computation required to incorporate new examples.
Dear Reviewer,
The author has provided responses to your questions and concerns. Could you please read their responses and ask any follow-up questions, if any?
Thank you!
The paper proposes a method for continual learning in an open-vocabulary setting for image classification. The method combines both a complementary learning system and a tree-probe approach to enable fast learning. Given the major concerns from reviewers unresolved even after the rebuttal, AC recommends rejection. Strengths and weaknesses are summarized below.
Strengths:
- The paper is well-organized.
- Comprehensive experiments are conducted on a wide range of datasets
Weaknesses:
- The method uses a CLIP model pre-trained on internet-scale data. This defeats the purpose of continual learning as it remains unclear whether the unseen data has been learned by CLIP beforehand. Though ijNK gives a score of 8; however, the reviewer's concerns remain.
- ATsK reviewer points out the lack of comparisons with existing prompt-based continual learning methods which also use pre-trained models.
- The reviewer SHhL highlights that the difference between zero-shot continual learning and open-vocabulary continual learning is not clear-cut. The motivations for open-vocabulary continual learning can be made clearer.
为何不给更高分
see weakness above
为何不给更低分
NA
Reject