5.0

/10

Rejected4 位审稿人

最低4最高6标准差0.7

4.5

置信度

正确性3.0

贡献度2.8

表达2.8

NeurIPS 2024

Bridging Inter-task Gap of Continual Self-supervised Learning with External Data

Haori Lu,Xusheng Cao,Fei Yang,Xialei Liu

OpenReview PDF

提交: 2024-05-01更新: 2024-11-06

摘要

关键词

self-supervised learningcontinual learning

评审与讨论

审稿意见

评分: 4置信度: 52024-06-16

This paper focused on continual contrastive self-supervised learning (CCSSL), highlighting that the absence of inter-task data results in sub-optimal discrimination in continual learning. The authors then proposed a method that performed contrastive learning of external data as a bridge between continual learning tasks. The proposed method achieves some improvements in a plug-in manner.

优点

This paper is basically well-organized and easy to follow.
I appreciate the idea that continual learning should consider the inter-task discrimination, which is limited by historical samples and under explored in literature. This results in a gap between ideal continual learning performance and joint training performance.
The proposed method seems to provide plug-in improvements over continual learning baselines.

缺点

The proposed method is essentially a straightforward extension of contrastive learning with external data, which limits novelty and technical contributions.
As acknowledged by the authors, the similarity of external data to the continual learning tasks is highly relevant to the performance improvements. The use of relatively different / OOD data tends to provide less improvements. Compared with the large amount of external data in use, such improvements may not be significant enough.
The employed external data is basically public datasets with careful pre-processing. In realistic applications, the external data in the wild (i.e., without such pre-processing) may result in additional differences and thus further limit the applicability of the proposed method.

问题

I appreciate the motivation and key argument of this paper. However, my major concerns lie in the technical contribution and the applicability in realistic applications. Please refer to the Weaknesses.

局限性

The authors have discussed their limitations and societal impact.

作者回复

2024-08-06

Thank you for your valuable comments. We address your questions or concerns below.

Q1: Novelty and technical contributions

A1: Thank you for your comments. However, we would like to clarify that our work is not a simple combination of contrastive learning and external data, but can bring many insights to the CCSSL field. Our novelty and contribution are mainly:

We point out the problem of ignored inter-task comparisons in CCSSL and demonstrate through many experiments (Fig 1 Right, Table 7 in Appendix A2.1) that this problem can exacerbate the model's confusion of inter-task classes. As reviewers oaUg and hAsz said, this problem is “significant but often overlooked”, “The finding that existing regularization-based CCSSL methods overlook inter-task discrimination is novel”, we believe that this problem is inspiring and novel. Although this problem has often been overlooked in previous works, it should be considered in any future work in CCSSL.

We demonstrate the soundness and effectiveness of using external data in CCSSL, which is novel in this field. As reviewers oaUg and hAsz said, “the proposed method leveraging external data is novel”, “BGE offers a creative solution”. For a long time, there has always been a gap between replay-based and regularization-based approaches (i.e., replay-based approaches generally perform better than regularization-based approaches, but have limited applicability due to privacy, safety, and other concerns). Considering the characteristics of contrastive learning (refer to Appendix A.2.4), we propose that using external data can also play a role in improving the performance of existing regularization-based approaches while avoiding the aforementioned potential concerns.

Q2: The performance improvements in realistic applications

A2: Admittedly, there is a relevance between the performance improvement of BGE and the quality of the external datasets. However, in realistic applications, we do not have to use wild data as external data. Instead, we can choose high-quality general datasets such as ImageNet, which are already capable of handling most of the realistic tasks.

In our paper's experiments (Table 1, 2, 5, 6), we artificially added a lot of OOD data to the external data, intending to test the ability of our method in more challenging scenarios, but such scenarios are not common in real-world situations, as most domains have established relatively well-developed public datasets nowadays. Therefore, our method can make sense in most real-world scenarios. In addition, even in extreme cases (where the external data we collect contains many OOD data), our OPO sampling algorithm can select external data that is more suitable to be added to the training, and the results in the Table 3 in our paper demonstrate the performance improvement of the sampling algorithm when the external data contains a variety of OOD data.

Q3: More detailed discussion of external data quality

A3: To clear the quality requirements for external data, we conduct experiments using various datasets as the external dataset. In Table 6, we show the results using Internet data (CC3M), generated data (GenImage), or fine-grained data (CUB-200) as the external dataset. Although these datasets have different characteristics, they are all still effective in improving the baseline method. One of them, CUB-200, is a dataset containing only various classes of birds, which can still be utilized by the BGE even though it is of very low quality for our task (due to lack of diversity). To simulate using external data in the wild, we use CC3M as the external dataset, which has images harvested from the Internet with very simple filtering (thus representing a wider variety of styles). Nevertheless, BGE still enhances the baseline method. We believe that even if we truly use external data in the wild with only a little filtering, the results would probably be in line with using CC3M.

In addition, the sampling algorithm we designed can also filter out OOD data to further improve the quality of external datasets. Therefore, concerns about applicability from the perspective of external data quality are less significant.

2024-08-11

I thank the authors for their rebuttal. I carefully read all reviewers' comments, the corresponding rebuttal, and the original paper. I understand the idea of using external data to improve inter-task representation in continual learning. However, this idea is not completely novel (a lot of papers have explored this idea). The proposed method is relatively straightforward and only leads to moderate improvements. Therefore, I cannot be more positive at the current stage.

2024-08-11

Thank you for taking the time to review our rebuttal and for your detailed assessment of our work. We appreciate your recognition of our approach to using external data to enhance inter-task representation in continual learning.

While we value your feedback, we respectfully disagree with the statement that "a lot of papers have explored this idea." To the best of our knowledge, our work is the first to investigate this approach within the CCSSL context. If there are other papers that have addressed this concept in a similar way, we would greatly appreciate it if you could provide references so we can better understand and compare our work with theirs.

Thank you again for your valuable insights. We remain committed to refining our research based on the feedback provided and hope you will reconsider our work in a positive light.

审稿意见

评分: 6置信度: 52024-07-05

This paper finds that existing methods in continual contrastive self-supervised learning (CCSSL)--a class-incremental learning scenario where the data is unlabeled--overlook contrasting data from different tasks, leading to inferior performance compared to the joint training upper bound. The authors propose to sample external data that are similar to each of the learned tasks to augment learning the current task. The self-supervised learning (SSL) objective on the union of the selected external data and the current-task data encourages the model to distinguish the current task and the learned tasks better.

The authors perform experiments with ResNet-18 and (mostly) BarlowTwins on CIFAR-100 and ImageNet-100, with a mix of other datasets as the external data. The authors find that their method, BGE, consistently improves existing CCSSL methods that do not perform inter-task discrimination. Differently, the joint training model does not benefit from external data.

优点

The finding that existing regularization-based CCSSL methods overlook inter-task discrimination and the proposed method leveraging external data are novel.
The experiments performed in the analysis section (Sec. 4.3) provide insights into why the method works and are interesting to me, especially on whether the benefit of external data comes from positives or negatives.

缺点

The SSL method used is mainly BarlowTwins (except in Table 4 where SimCLR is used), which is not usually considered as a contrastive learning method because it does not contrast the anchor with negatives. I wonder if (a) providing preliminaries in the contrastive sense (Sec. 3.1), (b) including "contrastive" in the setting name (CCSSL), and (c) arguing that OPO enforces diversity because of findings based on contrastive learning (L#191) are misleading. I think there also needs to be some intuitions on how non-contrastive SSL methods like BarlowTwins help distinguish inter-task data since it is the one used in the experiments, and such an analysis could be very interesting.
My general feeling about the writing is that, although the main ideas are conveyed clearly, some claims require justification, and can be improved. Besides some big words ("much more meaningful" in L#69, "widely agree" in L#138, "extremely low" in L#167, etc.), please see the questions below for concerns regarding specific reasonings.

问题

(L#157) Could the authors provide references for privacy concerns? I wonder if the model can still encode information about the (potentially private) training data into its parameters, even without storing data into a memory buffer. External data can also be private, which are stored into a large memory buffer (10K in the experiments) by the proposed method. Since replay-based methods do perform inter-task discrimination by minimizing the loss over the replay buffer, I think there needs to be more justifications regarding why they are not suitable here.
(Eq. 3) Is this a weighted sampling? I assume $|D_e^{t-1}|$ is much smaller than $|D_t|$ and the ratio changes as you see more tasks, so a uniform sampling would oversample the current task.
(Sec 3.3) Is diversity encouraged here because you sample the current task data uniformly one by one when selecting the closest external data? Also, why is diversity a "proxy for...future task data?" It seems to me that diversity just means that the external data estimates $D_t$ well.

Minor:

(Eq. 1) It is a bit weird to say that this equation is the objective of the CCSSL setting. I think the objective is always to minimize the loss when the expectation is taken over the global distribution $D$ (i.e., not over each $D_i$ ). Maybe a better way is to say that it is an approximate of the objective of existing regularization-based CCSSL methods.
(L#138-143) I think inter-task discrimination is not specific to CCSSL. In supervised continual learning (CL), there are works that discuss this problem, e.g., [1, 2]. The writing here suggests that lacking inter-task discrimination stems from the absence of labels, but it seems to me that it stems from CL itself.
(L#52) Why is the proposed method "more generalizable and robust to OOD data?" I thought the experiments (e.g., in Table 1 and 2) show that ID data is always the best choice for external data.
(L#54) Why does the proposed method "not require extensive external data?"
(L#59) "enables" -> "It enables".
(Eq. 2) Should there be coefficients before each loss term for the equality to hold?
(L#283) Which result is this paragraph referring to? In Table 3, I find some improvement on ImageNet100 also quite small (<1% in some cases).
(L#496) SGD with a learning rate of 0.4 seems a bit high to me. How did the authors perform hyperparameter search?
[3] also finds that existing CCSSL methods do not perform inter-task discrimination (which they call "cross-task consolidation"). They also propose an optimization objective (Eq. 8) similar to Eq. 2 in this paper. This is a concurrent work, but I wonder if the authors should discuss it due to the similarity.

[1] Learning a unified classifier incrementally via rebalancing. Hou et al. CVPR 2019.

[2] A theoretical study on solving continual learning. Kim et al. NeurIPS 2022.

[3] Integrating present and past in unsupervised continual learning. Zhang et al. CoLLAs 2024.

局限性

The authors mention that their method uses external data which preserves privacy. One concern is that when the external data is not curated (e.g., scraped from the internet), there is risk that they contain private or harmful information that can be learned by the model.

Another point is that the findings are limited to BarlowTwins (and SimCLR in one experiment) and regularization-based CL methods, and may not generalize.

作者回复

2024-08-07

Thank you for your valuable comments. We address your questions or concerns below.

Q1: Analysis of how BarlowTwins “contrast” during learning

A1: We treat BarlowTwins here as a generalized contrastive method. Although it is not directly contrast the anchor with negatives, [1] has approximately regularized its loss to keep positive pairs aligned (alignment) and negative pairs away (divergence). Specifically, after computing the cross-correlation matrix of two views' features, BarlowTwins constrains the diagonal elements to be close to 1 (first loss term) and off-diagonal elements to be close to 0 (second loss term). [1] indicates that the first term can be formulated as an upper bound on alignment, and the second term as an upper bound on keeping class centers apart (approximate divergence). Thus, its loss optimization has an implicit contrast effect. Therefore, BarlowTwins can also be explained by contrastive learning theory, justifying the use of "contrastive" word in paper. We will clarify this in future editions to avoid misunderstanding. In addition, we have evaluated some other contrastive methods, including SimCLR (Table 4, PDF Table R3Q5 (a)) , BYOL (Table 8) and VICReg (PDF Table R3Q5 (b)).

[1] Towards the generalization of contrastive self-supervised learning. Huang W, et al. ICLR 2023.

Q2: More explanation about privacy concerns

A2: Please see the common response.

Q3: Is this a weighted sampling?

A3: No, it’s a uniform sampling. We conduct a weighted sampling experiment based on "PFR CIFAR100 4 tasks, with CIFAR10 as external data", which ensures that the data in each epoch is well-balanced from each task. However, the result is 62.69, lower than 64.37 for uniform sampling.

We believe the reason may be: 1. $D_e^{t-1}$ has a limited size, thus it cannot provide sufficiently diverse external data as seeing more tasks. 2. Since the current task has not been learned before, the learning of in-task data is crucial to improve the performance of the current task.

Q4: Is diversity encouraged here because you sample the current task data uniformly one by one when selecting the closest external data?

A4: Yes, we sample one closest external data by one current task data for diversity purposes.

External data do not necessarily belong to the same class as in-task data, and their classes may also be similar to future classes. Since the feature distribution learned by contrastive learning is relatively uniform, the distribution of external data obtained by letting each in-task data select the nearest external data is also relatively uniform, enhancing diversity. The understanding of " $D_e^t$ estimates $D_t$ well" is correct, but $D_t$ ’s feature distribution is relatively uniform. If the goal of sampling $D_e^t$ is merely to proxy $D_t$ , we should select data with the most distinctive features of each class in $D_t$ (usually selected through class prototypes in supervised learning). In contrast, the goal of diversity helps in selecting external data that may feature future tasks, rather than exclusively best matching the old class features .

Q5: More contrastive learning method paradigm experiments

A5: We use BarlowTwins for most of our experiments because most of the prior works commonly include it, thus using it is convenient for comparisons. We also show results using another self-supervised learning method, BYOL, in Appendix A.2.2.

To further exhibit the generalizability of BGE, we extend BGE to SimCLR (ImageNet100 benchmarks, ImageNet-900 as external dataset) and VICReg (CIFAR100 benchmarks, CIFAR10 as external dataset) based on CaSSLe and FT, the experimental results are shown in the PDF Table R3Q5(a) and R3Q5(b).

Q6: Minor question responses

(Eq. 1, L#59, Eq. 2) Presentation questions: Thank you for your suggestions, we'll revise them carefully.

(L#138-143) Presentation of inter-task discrimination: Inter-task confusion is not specific to CCSSL; it's partly caused by catastrophic forgetting, which is common in all continual learning scenarios. My point in here is that due to the lack of labels, CCSSL needs to learn by inter-sample comparisons, and the absence of inter-task comparisons exacerbates inter-task confusion.

(L#52, L#54) Presentation of our method’s advantages: In paragraph of L#52, we compare BGE with prior continual learning methods using external data. So the “more generalizable and robust to OOD data” and “not require extensive external data” are our advantages over them. Some prior methods use external data in a supervised manner, needing to generate pseudo-labels, which degrade in quality with OOD data, affecting performance. Other methods require extensive data to stabilize feature space when training, compared with them, our required external data is less.

(L#283) OPO sampling algorithm analysis: This paragraph refers to the Table 3. It's true that in some cases it's not obvious.

Sorry for the misunderstanding of above minor questions. We will revise them in the future.

(L#496) Hyperparameters search: The learning rate in CaSSLe and PFR’ code is 0.3. We follow them and search hyperparameters around them.

Discuss the concurrent work: [1] discovers the lack of inter-task comparisons in existing CCSSL methods concurrently with us, but [1] solves it by simply saving old task data for comparison with new task data. We argue that saving old task data may be impractical due to privacy concerns and inaccessibility, so we propose using external data to compensate for inter-task comparisons. Our work not only resolves inter-task confusion but also provides a new performance improvement scheme for CCSSL without saving old task data based on the properties of contrastive learning (the properties refer to Appendix A.2.4).

[1] Integrating present and past in unsupervised continual learning. Zhang et al. CoLLAs 2024.

2024-08-11

I appreciate the authors’ rebuttal as well as the reference of Huang, et al.

I have read through the other reviews and the corresponding rebuttals. Most of my concerns are resolved, but I’m still not entirely convinced by the arguments around privacy concerns/motivation. This is mainly due to the lack of a concrete practical scenario where (i) we cannot store past training images, (ii) the external dataset is different enough with the training data (and thus not private) and (iii) using this external dataset helps performance. Unless based on a concrete scenario and experiment, I find it hard for people to agree on (ii).

Therefore I tend to keep my score of weak accept.

2024-08-12

Thank you for your thorough review and valuable feedback!

Regarding privacy concerns, we can provide several concrete scenarios where our method would be particularly applicable. For instance, in the context of remote sensing or surveillance images, storing past training images could raise significant security and privacy issues. As a result, it would not be feasible to retain these images directly. However, our method allows us to leverage in-task data to sample relevant external data from similar publicly available datasets [1][2] after the current training stage. These external datasets can be stored and used for subsequent training in a privacy-preserving manner. Given their relevance to the tasks at hand, it is reasonable to expect that using this external data can improve performance.

[1] Sun X et al., "FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery," ISPRS 2022.

[2] Li D et al., "A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios," TIP 2018.

We hope this clarification addresses your concerns. Thank you again for your insightful comments.

审稿意见

评分: 5置信度: 52024-07-10

The paper introduces BGE, a novel approach to address the challenge of inter-task data comparison in Continual Contrastive Self-Supervised Learning (CCSSL). BGE incorporates external data to bridge the gap between tasks, facilitating implicit comparisons and improving feature discriminability. The paper also presents the One-Propose-One (OPO) sampling algorithm to select relevant and diverse external data efficiently. Experiments demonstrate BGE's effectiveness in enhancing classification results across various datasets and its seamless integration with existing CCSSL methods.

优点

1.BGE offers a creative solution to a significant but often overlooked problem in CCSSL, enhancing the feature learning process through external data. 2.The paper provides extensive experimental results that validate the effectiveness of BGE in improving classification performance across different datasets.

缺点

1.The introduction of external data may increase the computational cost and training time, which could be a limitation for resource-constrained environments. The authors may provide more analysis about the extra time comsumption problem. 2.While BGE shows promising results, the paper could provide more insight into how the method scales with the size of the external datasets, which is crucial for very large-scale problems.

问题

How does BGE handle the potential privacy concerns that may arise from using external data, especially if the data contains sensitive information?

局限性

The paper acknowledges the increased computational cost due to the use of external data. However, it could further discuss the trade-off between performance improvement and time comsumption.

作者回复

2024-08-06

Thank you for your valuable comments. We address your questions or concerns below.

Q1: The trade-off between performance improvement and time consumption

A1: The additional computational consumption of BGE comes mainly from the additional amount of data, therefore, we statistic the performance and corresponding computational consumption ratios based on PFR 4 tasks as the external data budget $K$ increases.

Budget 0 2000 5000 10000 20000 30000
Performance 60.92 61.96 62.79 64.37 65.43 65.55
Cost 1 1.12 1.3 1.6 2.2 2.8

Budget	0	2000	5000	10000	20000	30000
Performance	60.92	61.96	62.79	64.37	65.43	65.55
Cost	1	1.12	1.3	1.6	2.2	2.8

The cost represents the ratio of the number of iterations to the baseline. As the budget increases, the improvement of BGE to the baseline method becomes more significant until the amount of external data reaches approximately 20,000, at which point it is difficult to improve performance by further increasing the budget. Considering the trade-off between performance improvement and time consumption, we set a budget of 10K in our method.

At the same time, to validate the effectiveness of BGE under the constraint of computational resources, we control the computational consumption of BGE to be consistent with the baseline method to conduct the following experiments. All experiments are conducted under the “PFR+BGE, CIFAR100 4 tasks, CIFAR10 as external dataset” setting.

Train less epoch. In this case, the model can only be trained for fewer epochs when more external data is used, the results of BGE with different external data budgets are shown below:

Budget 0 2000 5000 10000 20000 30000
Epoch 500 431 357 277 192 147
Performance 60.92 61.47 61.99 62.87 62.59 63.13

The second row of the table indicates the number of epochs that can be trained at different budgets. When the budget is large (>10000), even if the number of epochs that can be trained is low, the performance is still better than the baseline.

Budget	0	2000	5000	10000	20000	30000
Epoch	500	431	357	277	192	147
Performance	60.92	61.47	61.99	62.87	62.59	63.13

Incorporating external data by mixup. In this case, instead of adding external data directly to the training, we mix each in-task data with a randomly selected external data when training, making the number of data iterations the same as the baseline. The only additional consumption introduced here is the mixup operation, which is negligible.

Method PFR PFR+BGE $_{mixup}$ PFR+BGE
Performance 60.92 62.57 64.37

Method	PFR	PFR+BGE $_{mixup}$	PFR+BGE
Performance	60.92	62.57	64.37

Q2: How BGE scales with the size of the external datasets?

A2: As shown in the first table of A1, as the size of the external datasets increases, BGE provides more improvements to the baseline method. Until the budget reaches about 20,000, further increases in scale cannot seem to keep improving performance. The budget is slightly larger than the scale of our in-task data at this time. Therefore, we observe that the scale of external data should roughly match the scale of in-task data in our method.

Q3: How does BGE handle the potential privacy concerns that may arise from using external data, especially if the data contains sensitive information?

A3: We can limit the selection of external data to publicly available datasets on the Internet, and refer to the license and other statements of each dataset to select data that are allowed to be publicly used, and they usually do not have privacy risks.

If in some special cases, using publicly available datasets as external data cannot harvest the desired results, and therefore we have to collect the data by ourselves, we can introduce some techniques from other domains (e.g. privacy filtering, differential privacy, machine unlearning) that are complementary to our method to address privacy concerns. Since the number of candidate external data is almost infinite, even after filtering, some useful external data can still be retained.

Another perspective of privacy is data accessibility, where we may not have access to old task data. For example, some existing models on the Internet may only release its checkpoint but not the training data for some privacy reasons. While if we choose to use publicly available external datasets, there are no accessibility concerns.

审稿意见

评分: 5置信度: 32024-07-13

The authors:

argue that an optimal model for continual contrastive self-supervised learning should perform as well as a model trained with contrastive learning on the whole set of data, including negative samples taken between different temporal slices of the dataset, no just within the same temporal slice
propose a method for using pre-existing external data to augment the temporally constrained dataset

For context on my background, I am very familiar with SSL literature, only loosely familiar with continual learning, and have never heard of continual contrastive learning before.

优点

Researchers measure the performance of their technique on top of several existing techniques for preventing catastrophic forgetting, showing the performance of the combination of methods. It is valuable to know this.

The authors make comparison against some baselines - not just training on the joint data from scratch (non-continual learning paradigm); but also with external data added. They also demonstrate the performance with random sampling of the external dataset vs smart sampling with their algorithm, and investigate some ablations. These results help to inform where their approach provides value.

The discrepancy between the performance of a model trained with negative samples between subdatasets and with negative samples only taken within subdatasets appears to be a noteworthy observation and one which should be discussed within the continual contrastive learning community. To my understanding, the joint task as the authors suggest sounds appropriate. However, this change could be construed as changing the task that is posed to the model and moving the goal-posts (defining an easier task than is used in the literature at present), indeed as could incorporating external data. The question is in some sense what the goal of CCSSL truly is. In standard continual learning, the goal is to retain performance on previous tasks in the face of training on new tasks. This is often simulated by having classes within a dataset arrive in staggered batches. However, in contrastive continual learning there appears to be only one task (contrastive learning) and the data on which that one task is trained is merely staggered. Does the authors' approach break down an artificial barrier that was in place to simulate a harder task? Or a barrier that should not have been present in the first place and is a vestigial barrier inherited from continuous learning? This is unclear to me.

The work is generally well presented.

缺点

My understanding of continual learning is that one experimental paradigm that is used is to retain previously presented subdatasets/subtasks and to prevent catastrophic forgetting by including the old tasks in the mix while introducing a new task (e.g. Robins, 2001, Aljundi, 2019). This setup is not considered in the paper, but it seems it would address a significant fraction of the issues the method is attempting to address with regards to the joint vs intra-only training configurations. I imagine results may still be improved by incorporating external data in the "imaginative" capacity before the full dataset has "arrived", even in this scenario. It is unclear why the authors retain this barrier (refusing to continue training on datasets $D_{0, ... i-1}$ even whilst changing the task from a series of isolated contrastive learning tasks to a joint contrastive learning task where data arrives at staggered intervals.

The paper is missing comparison to some additional baselines which would be useful to see:

What is the performance for a model trained solely on external data, without using the continual learning dataset?
The performance Joint+ED uses a static subset of the external data. One could also consider using Joint + a subset of size $K$ of the external data that changes every epoch, so potentially the model eventually sees all samples from ED and not just a subset.
What would the performance be if instead of finding external data proxies for the existing data, you simply retained the previous D_i datasets from previous tasks without discarding them?

Statistical significance

There are no evaluation as to whether the difference in results is statistically significant. This could be done over repeated runs with different seeds; experiments appear to only be performed with a single random seed. As the authors have repeated runs with different experimental paradigms which could be combined together to make a test for difference without needing to perform experiments with multiple seeds, however there may be correlated randomness between the experiments (i.e. it would be better if experiments are not all performed with the same seed, nor the same ordering of tasks; these should be held constant between comparators and varied between runs to eliminate the effect of these hidden variables on the findings).

Figures

Fig 1: tSNE has parameters that need to be tuned correctly (perplexity in particular) in accordance with the scale of the features, whereas the more recent technique PaCMAP doesn't and typically produces better results without tuning. The lack of tuning of tSNE may impact the distribution seen in the figure, resulting in one method appearing better than the other by chance where a different choice of perplexity may have resulted in different findings. It is not clear whether the classes were cherry-picked to give favourable results for the authors' method and bad for existing methods. (I am not asserting that they were cherry-picked, but it is not indicated how the classes were selected in the paper so it is not possible to know whether they were or if these results are representative.) These points are not so important as the figure is more illustrative than quantitative anyway.

Fig 2: Font size is too small; to maintain legibility, figure fonts should be no smaller than ~70% the font size of the main text.

Tables

Table 5: Not clear why this experiment was performed with PFR only. The experiment does not necessarily need to be run with FT and CaSSLe too, but the authors should say on what basis PFR was selected (i.e. it performs better than FT and CaSSLe).

Table 6: Not specified which method was used (FT, CaSSLe, PFR)

Table captions should indicate what the initalisms (CP, CPI, IN, INP, IND) stand for, so readers don't have to look in a distant part of the text to find out. In general, these initialisms are not intuitive - the characters are all run together and the number of characters coming from a dataset in the group is sometimes 1, 2, or 5; "I" and "P" can not be intuitive when there are multiple datasets being used that start with this character - and this makes it hard to follow the results. The table headings could be restructured to make this clearer e.g. instead of CIFAR, CP, CPI; use as headings C-10, +P365, +IN-R, which are immediately readable and convey the difference between the columns from each other succinctly.

Tables would be more readable if you used \cmidrule to indicate the groupings that the headings apply to, instead of having a rule across the whole table.

Typographical

L59 Missing word "with them. [This] enables the"
L68 "performance doesn't improve even sometimes decreases."
L89 "Since no labeling requirement, incorporating"
L239 sentence is not written correctly

Citations

Casing of initialisms is wrong on numerous citations, e.g.

[2] vit
[15] Pathnet
[40] icarl
[47] t-sne
[48] caltech-ucsd birds

Some citations provide no location at which the paper being cited can be found, e.g.

[48]

Some citations cite arXiv versions of papers instead of peer-reviewed versions, e.g.

[13] https://openreview.net/forum?id=YicbFdNTTy

问题

What is the anticipated scenario where this experimental paradigm would be utilized? Without an understanding of the intended usage, it is unclear whether linear probing is a sufficient evaluation, or if others such as fine-tuning (advocated by e.g. MAE), kNN (advocated by e.g. DINOv2), or clustering (advocated by e.g. ZSC) should be considered in addition to LP.

The size of the external dataset being added, $K$ , does not appear to be indicated in the paper. What is this value and how was it selected? What is the impact of the choice of $K$ ?

Are the groupings of classes the same or randomized between experiments?

The sweep to add new data to the externally derived dataset includes a loop in a loop, comparing every sample in the current subdataset against every sample in the external dataset(s). How expensive is this step? There are presumably trade-offs that could be considered in practice such large the pool of external data should be and whether to prune it in advance in order to minimize the cost of the external data selection sweep, and how often the external data can be swept over (in this paper the sweeps are locked to the data arrive/departure times but in a more general continual learning framework it would be important to know how often to update the subset of the external dataset being used for training).

局限性

The motivation for the method is a niche of a niche. I can not see the union of these restrictions being a scenario encountered in practice. The requirements for the paradigm are:

A large repository of unlabelled training data for this task does not yet exist to train the model on.
A continual stream of training data for the task will become available over the course of the period of time where the model is trained (and the model subsequently refined as more data becomes available).
A very large repository of publicly available data that is near-OOD to the domain of the task does exist.
There is domain-shift in the continual stream of incoming data that is of a magnitude comparable to the domain shift between the stream of data and the pre-existing external data.
Although it is fine to train our model on the continual stream of data when it arrives, for privacy reasons we want to periodically destroy the in-domain data we have collected.

This set of restrictions seem unlikely to occur in practice:

For modalities other than vision, contrastive learning is often challenging to deploy due to its reliance on a robust, manually-curated, augmentation stack.
For photographs of objects in the world, large datasets already exist (such as is used in the paper).
For medical images, large near-OOD datasets are not available; furthermore, if you have the rights to train the model on data in a way that is secure and retains the privacy of that data, you do not lose those rights to access the data, so you can keep training on previously collected data.
For personal images that are requested to be deleted from the company's database by the owner, models may be required to forget the personal images, in which case catastrophic forgetting is advantageous! These requirements have created the nascent field of machine unlearning [1], [2], [3].

作者回复

2024-08-07

Thank you for your valuable comments. We address your questions or concerns below.

Q1: The goal and setups of CCSSL

A1: In self-supervised learning, we expect the model could learn discriminative representation from unlabelled seen data. In Continual Self-Supervised Learning (CSSL), the data to be seen is set as a non-stationary data stream with task intervals, and we expect that the model can accumulate the discriminative representation of all seen data as tasks come. Continual Contrastive Self-Supervised Learning (CCSSL) focuses on contrastive self-supervised methods in continual learning. In CCSSL, the concept of “a task” represents "a stationary data distribution", and by learning this task, the model can discriminately represent the data for this distribution. In concrete experimental setups, we divide all the classes of the dataset equally into each task, and the classes in each task are not intersected.

CCSSL is crucial because it allows the model to retain the discriminative representation for old tasks while enhancing the representation for new tasks in continual training without training from scratch, enhancing the training efficiency in practical applications. Meanwhile, self-supervised learning offers more generalized representations than supervised learning, so SSL also has the advantage of less forgetting in continual learning research.

Q2: Additional baseline: compare with a model trained solely on external data

A2: We appreciate your comments about useful baseline settings. We conduct experiments with training the model using solely external data, the results are shown in the PDF Table R1Q2. Although total external data (CIFAR10 or a subset of CC3M (~431K images)) are jointly used to train the models, their performances are worse than our method.

Q3: Additional baseline: change external data every epoch during Joint+ED experiment

A3: We also update the external data at each epoch in Joint+ED, achieving a result of 68.02, which is still close to Joint without ED (68.09). This indicates that BGE’s performance is not dependent on seeing all samples from ED.

Q4: Why we can’t simply retain the previous tasks’ datasets

A4: Please see the common response.

Q5: The experimental anticipated scenario

A5: Our evaluation aims to measure the performance of different methods in maintaining stable discriminative representations in continual self-supervised learning, rather than pursuing the best performance on each downstream task. Therefore, we use linear probing as a straightforward tool for the classification task without updating the backbone. As the reviewer suggested we also evaluate the KNN classification accuracy in the PDF Table R1Q5, where incorporating BGE on top of PFR also improves the KNN accuracy. Other evaluation methods we will explore in future work.

Q6: Related information about external data budget $K$

A6: We report the budget $K$ value in our paper's L228 and L256. We set $K$ 10K in the main experiments. We also conduct experiments as $K$ changes, under the setting of CIFAR100 4 tasks, using CIFAR10 as the external dataset (in the PDF Table R1Q6).

As $K$ increases, the improvement of our method becomes more significant. However, it also introduces additional computational costs. Considering the trade-off between performance and efficiency, we choose the $K$ value 10K.

Q7: Are the groupings of classes the same or randomized between experiments?

A7: To ensure fairness, groupings of classes were the same for all experiments.

Q8: The cost, pruning, and sweep interval of external data selection sweep algorithm

A8: After extracting all data features, 13,000 in-task data sweep the external dataset at a rate of approximately 48 seconds per 100,000 data. Compared with training time (typically many hours), this cost can be ignored.

It is certainly possible to use a subset when the external dataset is too huge. In our paper’s Table 6, the experiment on CC3M( total ~3M) just used a randomly chosen subset (~431K). To further validate, we set the external data to ImageNet-1K (~1.3M) or a randomly selected subset of it (~100K) and the results are shown in the PDF Table R1Q8(a). These results demonstrate that using a subset of a large external dataset yields comparable results to using the full dataset.

We also explored the impact of updating external data at different intervals, setting the update interval to 100 epochs and 200 epochs, with results shown in the PDF Table R1Q8(b). The results indicate that more frequent updates do not improve BGE's performance. Considering that the external data must compensate for the inter-task comparison, we update external data at the end of each task's training.

Q9: Discuss the limitation

A9: Please see the common response.

Q10: Statistical significance

A10: We report the statistical significance of primary experiments in Appendix A.2.6. And experiments in Table 1 and 2 are all performed with the same seed, the same ordering of tasks, to eliminate the effect of hidden variables.

Q11: t-SNE Figure

A11: To ensure fairness, we set all parameters of each t-SNE plot strictly consistent. We show the t-SNE plots to visualize the problem of inter-task confusion and the effect of our solution in an intuitive and illustrative manner, and to achieve complementarity with the quantitative results in the subsequent tables.

Q12: Writing comments

A12: We appreciate your valuable suggestions about figures, tables, typography, and citations, and we will revise them carefully.

2024-08-14

Thanks to the authors for their response, and for running several initial experiments along the lines of the additional baselines I recommended. These are encouraging for the effectiveness of the proposed method. The additional results showing the effect of changing the external data budget are also useful to see. If the costs are low, as the authors indicate, then these appear to motivate an increase to $K=20000$ , in my opinion,

I note that the method proposed by the authors has two names in the paper, BGE and One-Propose-One. I think a more apt name might be "replay by proxy", or "surrogate replay", or something along these lines. When viewing the method as a replay-based method that uses a proxy sample instead of the original sample, the need for an "oracle" baseline that uses the original samples for replay instead is made more stark. Random should be a lower-bound to OPO, whilst a model that replays the original samples should serve as an upper-bound to OPO. This would be especially useful to have in order to indicate how far the performance is between these two bounds - are the proxies just as good as the originals?

2024-08-14

Thank you for your valuable feedback!

Comparison of BGE and the "oracle" baseline

We appreciate your comments about the comparison with the oracle baseline. To validate the effectiveness of our method, we compared the results of BGE using CIFAR10 or ImageNet as external data with the oracle baseline on the "CIFAR100 PFR" setup, and the results are as follows:

CIFAR10 as external dataset：

4 tasks:

Budget	0	2000	5000	10000
BGE	60.92	61.96	62.79	64.37
Oracle	60.92	62.31	64.19	65.21

10 tasks:

Budget	0	2000	5000	10000
BGE	55.57	58.41	59.66	61.02
Oracle	55.57	59.16	60.66	61.75

ImageNet as external dataset (4 tasks):

Budget	0	10000
BGE	60.92	64.75
Oracle	60.92	65.21

In most cases, the difference in performance improvement between BGE and Oracle is relatively small (<1%), which proves that with adequate external data, BGE is capable of achieving performance improvement close to Oracle without preserving old data.

We have also demonstrated that our method outperforms the random baseline in the initial submission. We appreciate the suggestion of “a more apt name might be "replay by proxy", or "surrogate replay", or something along these lines.”. We will incorporate this idea into the final version by explaining our method's connections to both the random and oracle baselines.

2024-08-14

Thanks to the authors for their response. I am happy to raise my score on the understanding that the authors revise the paper in congruence with the discussion and preliminary additional experiments shared by the authors:

To provide the examples of remote sensing or surveillance applications in the introduction
Highlight clearly the limitation of applications to domains (e.g. medical imagery) where there are no large public datasets that are suitable to use for the external data
Add the replay baseline throughout all experiments in the paper (this is IMO the most important baseline to add)
Add the rotating external data baseline
Add the external-data-only baseline
Add exploration of the effect of external data budget size (K)
Add the fraction of DomainNet domains being selected

Also, one final thing to clarify.

Q11: t-SNE Figure A11: ... To ensure fairness, we set all parameters of each t-SNE plot strictly consistent....

Yes, but t-SNE is sensitive to the choice of the perplexity parameter in a way where the best perplexity parameter varies depending on the data that is being plotted. So just using the same perplexity for each method isn't necessarily going to work well. I realize that the figure is only for illustrative purposes, so it is only a minor point, but I wanted to let you know that t-SNE can be misleading and more recent techniques such as PaCMAP do not have the draw-back of being so sensitive to the choice of parameters.

2024-08-14

Thank you for the insightful and comprehensive feedback. We have gained valuable insights from your comments and will revise our paper accordingly. Additionally, we will include PaCMAP as an alternative tool for visualization.

作者回复

2024-08-07

Thanks to all reviewers thought feedbacks. We have carefully read all the comments and summarized the recognition of our work as follows:

Reviewer		Comments
hAsz&oaUg&xWKh	Finding's Novelty	a significant but often overlooked problem in CCSSL; The finding that existing regularization-based CCSSL methods overlook inter-task discrimination is novel; I appreciate the idea...
hAsz&oaUg	Method's Novelty	...a creative solution; the proposed method leveraging external data is novel.
gaeT&xWKh	Method's Description	The work is generally well presented; This paper is basically well-organized and easy to follow.
gaeT&hAsz&oaUg	Evaluation	These results help to inform where their approach provides value; ...extensive experimental results that validate the effectiveness of BGE; The experiments performed in the analysis section (Sec. 4.3)...are interesting to me.

Due to the character limit, the table data of the responses to some reviewers (R-gaeT, R-oaUg) had to be put into the PDF. We apologize for the inconvenience.

To address common questions from multiple reviewers about privacy concerns and external data quality, we offer common responses below:

Common response to R-gaeT, R-hAsz, R-oaUg: More explanation about privacy concerns

Q1: Why we can’t simply retain the previous tasks’ datasets?

A1: Under the constraints of continual learning, we cannot save the entire old task data. Some continual learning methods overcome catastrophic forgetting by saving a small subset of the old task data and replaying it in subsequent tasks (replay-based method). However, even if only a subset of previous data is saved, storing the data may lead to sensitive information privacy concerns.

The privacy concerns we refer to in the paper arise primarily from the direct storage of in-task data. When data contain sensitive information, we do not want it to be saved and used continuously. However, if the replay-based method saves in-task data that happened to contain sensitive information, the replay-based method will not achieve the desired result after deleting them.

Another perspective on privacy concerns lies in the accessibility of the data. We may not have access to previous task data. For example, when we continue to train a model based on other people's open-source checkpoint, its training data may not be released together. At this point, it is not possible to continue training the model using replay-based methods, whereas BGE can still be used.

In summary, simply saving previous data cannot solve all the problems of continual learning. Studying continual learning methods with the restriction of not saving previous data makes sense, and the barrier of refusing to save any previous data is not maintained only in our work. In recent years, works [1][2][3] that do not save previous data have received more attention, turning into a mainstream research paradigm.

[1] Self-sustaining representation expansion for non-exemplar classincremental learning. K Zhu et al. CVPR 2022.

[2] Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. L Wang et al. NeurIPS 2023.

[3] Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning. X Liu et al. CVPR 2024.

Q2: How does BGE handle the potential privacy concerns that may arise from using external data?

A2: We can limit the selection of external data to publicly available datasets on the Internet, and refer to the license and other statements of each dataset to select data that are allowed to be publicly used, and they usually do not have privacy risks.

If in some special cases, using publicly available datasets as external data cannot harvest the desired results, and therefore we have to collect the data by ourselves, we can introduce some techniques from other domains (e.g. privacy filtering, differential privacy, machine unlearning) that are complementary to our method to address privacy concerns. Since the number of candidate external data is abundant, even after filtering, some useful external data can still be retained.

From the perspective of data accessibility, if we choose to use publicly available external datasets, there are no accessibility concerns.

Indeed, during learning, the model encodes the learned knowledge into its parameters, which may also pose potential privacy concerns, but this problem arises due to the training process rather than data storage, therefore usually requires approaches from other domains such as machine unlearning to address.

Common response to R-gaeT, R-xWKh: Discuss the limitations of external datasets

Thanks for the summarization of BGE's limitations, but we would like to clarify that BGE's limitations on external data are not that strict.

In terms of scale, external data comparable to the in-task data can yield promising results. In terms of data quality, BGE does not require all external data to be of the same domain as the in-task data, as shown in our ImageNet100 experiments using the DomainNet dataset as the external dataset. When the domains of external data are complicated, the OPO sampling algorithm helps align external data with the in-task domain. For tasks with scarce data (e.g., medical images), the real external datasets are limited, but thanks to the development of image generation, methods for generating scarce data have been widely explored. We also show the compatibility of BGE for generated data in our paper’s Table 6. Finally, since in real-world continual learning, we cannot know the specific target of the next task, even if a large-scale image dataset already exists, training directly on it does not necessarily yield positive results on the task-specific target. Our work precisely helps utilize large-scale existing datasets effectively for continual learning.

评论- Small question

2024-08-10

I thank the authors for the response.

For example, when we continue to train a model based on other people's open-source checkpoint, its training data may not be released together. At this point, it is not possible to continue training the model using replay-based methods, whereas BGE can still be used.

Could you explain how BGE can be used in this case? I thought BGE required selecting external data based on their similarity with in-task training data.

2024-08-10

Thank you for your valuable feedback!

We recognize that this is a challenging scenario. Our method usually starts by selecting external data using in-task data as a reference. However, when in-task data is unavailable, an alternative approach is required to adapt our method. One possible solution is to analyze the uncertainty in the model’s predictions across a broad range of external inputs. This allows us to pinpoint areas in the input space where the model shows high confidence, which likely reflects the original training data distribution. We can then use a subset of these high-confidence predictions as a proxy for in-task data, facilitating the selection of additional external data using our algorithm. We will incorporate these discussions into the paper and make the necessary revisions.

2024-08-11

Thank you for the clarification.

2024-08-13

For example, when we continue to train a model based on other people's open-source checkpoint, its training data may not be released together. At this point, it is not possible to continue training the model using replay-based methods, whereas BGE can still be used.

I was also confused by this and was going to write the same response as oaUg.

However, when in-task data is unavailable, an alternative approach is required to adapt our method. ...

Changing the model to work in such a scenario sounds highly involved, and so not a feasible motivation to use for this work. I do recommend the authors explore the scenario of a public model with a private dataset further, as this is a plausible situation to be in, but do so in a future submission where they tailor their methods for this scenario.

Yet the paper certainly needs a better justification for the experimental paradigm, as this was queried by multiple reviewers. For the methods as they are in the paper, I recommend the authors focus on better motivating the need for the work from a privacy perspective instead.

2024-08-14

In terms of data quality, BGE does not require all external data to be of the same domain as the in-task data, as shown in our ImageNet100 experiments using the DomainNet dataset as the external dataset.

There are 6 domains in DomainNet, and one of them is "real" which is the same domain as IN-100. Do the authors know (or can they check) what fraction of the images which were selected from this external dataset came from the "real" domain (and hence were in-distribution) vs other domains (and hence were OOD).

For tasks with scarce data (e.g., medical images), the real external datasets are limited, but thanks to the development of image generation, methods for generating scarce data have been widely explored.

I see trends like this often, and I find such methods highly concerning for a couple of reasons. (1) Models trained on generated images have issues, as has been reported multiple times. Though the studies look at "model-inbreeding" from the perspective of training generative AI on the outputs of generative AI, rather than training representation learning on the outputs of generative AI, it is still cause for concern if the representation learning is to learn the biases of the generative model. (2) Using a generative model to create samples for representation learning can work via transfer learning from the knowledge in the generative model to the representation model. To do this, the generative model needs to understand the domain that is being generated and has its own representations inside the model. So instead of distilling the information via image space, you can adapt the generative model and avoid having to train a new model [1, 2]. But also, how was this generative model trained if there's no large datasets available for the domain of interest?

I remain concerned that there is a mismatch between the scenario being explored in the paper and any real world scenario. Can the authors provide concrete examples of some tasks and data domains where they see their method being deployed? One where a dataset comes in continually and must be destroyed soon after training on it; one where there is no large publicly available in-domain dataset already available to train the model on. And exactly what external data would you propose to use?

[1] Mittal et al. "Diffusion Based Representation Learning", ICML 2023.

[2] Hudson et al. "SODA: Bottleneck Diffusion Models for Representation Learning", 2023.

2024-08-14

Thank you for your valuable feedback!

Do the authors know (or can they check) what fraction of the images which were selected from this external dataset came from the "real" domain (and hence were in-distribution) vs other domains (and hence were OOD).

We check the fraction of images from each domain being selected after the first task training when using DomainNet as the external dataset. We set the sample budget 10K, and the results are as follows:

Domain	real	clipart	infograph	painting	quickdraw	sketch
OPO	4153	635	707	1734	1730	1041
Random	2942	819	892	1271	2894	1181

The second row displays the fraction of images from each domain sampled by OPO, while the third row shows the fraction of images sampled randomly. The results indicate that OPO sampling yields a higher fraction of images from the "real" domain and “painting” domain compared to other domains, demonstrating the effectiveness of the OPO algorithm. Additionally, the proportion of “quickdraw” images is largely reduced compared to random sampling. We will incorporate this interesting experiment into the final version of our paper.

Using AI generated images

Thank you for your comments regarding the use of AI-generated images. In our paper, we presented AI-generated images as an alternative source of external data. While generative AI techniques have shown promise in many domains, we acknowledge that they may have limitations, as you pointed out.

We want to clarify that the use of generative external data was not the primary focus of our paper. Nonetheless, we recognize the importance of addressing the potential issues associated with generative data. We will incorporate a discussion of these concerns and their implications in the final version of the paper to provide a more comprehensive perspective.

Can the authors provide concrete examples of some tasks and data domains where they see their method being deployed?

We can provide several concrete scenarios where our method is particularly applicable. For instance, in remote sensing or surveillance applications, storing past training images can pose significant security and privacy issues, making it impractical to retain these images directly. Our method addresses this challenge by leveraging in-task data to sample relevant external data from similar publicly available datasets [1][2] after the training stage. These external datasets can be stored and used for subsequent training in a privacy-preserving manner. While our method may face limitations in scenarios where no publicly available in-task external data exists at all, we believe it remains effective in most cases where external sources are available. We will incorporate these discussions and possible limitations in the final version of the paper.

[1] Sun X et al., "FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery," ISPRS 2022.

[2] Li D et al., "A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios," TIP 2018.

最终决定Reject

2024-09-25

The AC has carefully assessed the paper, all reviewing materials, rebuttals, and discussions. While the paper presents promising results in contrastive continual learning, there are significant issues that prevent it from being ready for publication. The primary concern is the incomplete nature of the experiments in the paper. Several key baselines and important evaluation metrics needed to fully assess the impact of the proposed method were missing. The authors did some make efforts to add preliminary results during the rebuttal, but given the significant changes required to provide a thorough evaluation, the paper would require another cycle of review to ensure proper standards are met.

However, the AC encourages the authors to address these issues and resubmit the paper in a future cycle. The topic is compelling, and with further refinement and a more complete set of results, the work has the potential to make a significant impact in the field.