Multiple Modes for Continual Learning
摘要
评审与讨论
This paper analyzed the optimization behavior of multi-task learners and observed that learning all incremental tasks in a shared mode tend to suffer more interference. Then, the authors proposed to train multiple parameter modes in parallel and then optimize task allocation per mode. The proposed method can improve the performance of regularization-based baselines such as EWC.
优点
-
The empirical analysis of parameter mode is interesting. It provides a closer look at parameter drift in continual learning.
-
The proposed method extends the previous use of ensemble model in continual learning.
缺点
-
The motivation for the model design is somehow intuitive and based on toy empirical results.
-
An important related work is missing. CoSCL [1] is a recent ensemble-based method that also train multiple parameter modes in parallel and regularize their diversity, implemented with regularization-based baselines. I find many empirical analyses between this paper and CoSCL are similar, such as the trade-off between learner number and size (Figure 2), and the flatness of loss landscape (Figure 3). To demonstrate the contributions of this paper, an in-depth comparison is strongly encouraged.
-
Although the proposed method can improve EWC to some extend, the performance lead against the strongest baseline (i.e., WSN) seems to be limited and is demonstrated in relatively simple benchmarks (i.e., Task-IL and CIFAR-100). Experiments in more complex benchmarks will be more supportive.
-
All experiments use a pre-trained checkpoint on ImageNet that overlaps with downstream datasets such as CIFAR-100 and Tiny-ImageNet.
[1] CoSCL: Cooperation of Small Continual Learners is Stronger than a Big One. ECCV 2022.
问题
In addition to the major concerns in Weaknesses, I have two additional questions:
-
How many times were all the experimental results run with different random seeds?
-
The formatting of the references is confusing. The references of EWC appears twice.
The manuscript presents a Continual Learning strategy that is based on an ensemble method, along with an associated initialization and regularization method, to update the models of the ensemble in a manner that mitigates the forgetting of past tasks.
优点
I find the idea of utilizing multiple 'modes,' which are encouraged to develop different sets of parameters, and then attempting to update them appropriately and selectively, to be truly intriguing. Additionally, I believe that the comment on the motivating factor behind the notion that the way most regularization methods are used to enforce model stability is quite sound.
缺点
I believe that the primary issue with this work is its lack of clarity and organization. It is quite challenging to read, and I attribute this difficulty not to any inherent technical complexity in the proposed method but rather to its poor presentation. I find this to be particularly unfortunate and frustrating because, in my opinion, the work has the potential to make a valuable contribution. It's worth noting that the manuscript falls almost a page short of the total length allowed for this venue, and utilizing that additional space could have helped address some of the main concerns I will now list (in no particular order).
-
In many parts of the manuscript, you refer to 'global geometry.' Sometimes, you speak of the global geometry of the parameter space, while at other times of the loss landscape. Although these concepts may be related, they are not the same thing. Furthermore, there are instances where you use the term 'global geometry' generically. It would greatly enhance clarity if you provided a formal definition of what you mean by 'global geometry.'
-
A complexity analysis is necessary to understand how the method scales. The method's update process involves a backtracking algorithm, and some of the loss calculations require computing distances between sets of weights from different models for each task. A discussion of the computational complexity would provide valuable insights.
-
I really do not understand either the statement nor the implications of Theorem~1; in the statement there are either undefined symbols like or odd notation like , when it has been previously define as a natural number and it is clearly not a set (similarly for ).
-
No explanation/justification has been given to why the are randomly sampled nor the specifics of their random generation. Why it is better than posing for all , how does this sampling affects performances?
-
The authors choose to update the parameters of the various models to minimize the drift from previous parameter values. While I partially understand this approach, it seems to be a critical point of criticism the authors had against EWC. A more comprehensive discussion of this point is needed.
-
The paper's organization is suboptimal. Section 2 combines background information with motivations and results, while Section 3 presents crucial aspects of the proposed method more as documentation for the 'initialize_parameters' and 'update_parameters' methods than as an organic organization of concepts.
-
The manuscript lacks a discussion of the signs of and , even though it is crucial. Based on the pseudocode, if is positive, it effectively minimizes the distance, while the text suggests that it maximizes it. A clarification regarding the sign and its implications is needed.
问题
-
Assuming we have a consistent definition of the "global geometry" it is not clear to me how you can "leverage on global geometry" but still working at the local level of the various tasks.
-
You speak of "functional diversity" what do you precisely/mathematically mean with this?
-
There are two figures with the same label (Figure 1).
-
I think that on page 2, in the "Trade-off" subsection Table 2 should be referenced instead of Table 3.
-
Figures and tables are almost illegible.
-
argmin in mathematics is a set (the set of all point in which the function at hand realise its minimum value) here it is treated as a vector (point of a set).
This paper proposes a continual learning method that uses multiple modes to prevent forgetting. The approach was inspired and supported by a theorem showing that multi-mode parameters are closer to the MTL parameters than single-mode parameters. The authors show the effectiveness of the proposed method in Instance-IL and Task-IL. The authors also visualize the loss landscape to demonstrate their idea.
优点
The proposed method is intuitive and shows some advantages in performance compared to the existing baselines.
缺点
- The initial task seems to be important as the modes are computed initially using the first task. Experiments about this should be included.
- The authors should report the training time of the proposed method as well as that of all the baselines. Seems like the proposed method is computationally expensive as it computes the gradient of the parameters in each mode in a task (line 7 Alg.1), and it computes another gradient for the combined parameter (line 12 Alg.1).
- The task split is not clear. The authors said ``in instance-IL, each new task brings new instances from known classes”, but Instance-IL Split-CIFAR100 is constructed by dividing 100 classes into 5 tasks with 20 coarse labels per task. From my understanding, each task is disjoint from each other.
- I personally think the main CL setups are class-incremental, task-incremental, and domain-incremental learning. If the proposed method targets a different learning scenario, the authors should have made the definition of the scenario more clearly. From the current manuscript, it's not clear how Instance-IL is constructed.
- The authors said the method is online. However, according to the Experiment Configuration of the Appendix, the model is trained for 200 epochs. Is this online CL, where the model is trained for a single epoch per task or is it offline, where the model is trained for multiple epochs?
- It's concerning that the authors use a network pre-trained with the full-ImageNet data to train for learning CIFAR and Tiny-ImageNet. A lot of the classes in ImageNet are very similar to the CL datasets CIFAR100 and Tiny-ImageNet. Therefore, there could be information leaks from the pre-training classes to the CL classes. For this reason, many existing methods ensure that the pre-training and CL classes are different. For instance, [1] uses a pre-trained model pre-trained with samples in ImageNet after removing the classes similar or identical to the CL datasets. [2, 3] pre-train their model for half of the classes (e.g., 50 for CIFAR100) and continually train the model with the remaining classes.
- The experiment results are not surprising. The task-IL experiment result seems a bit low. [4, 5] achieved more than 80% accuracy and [6] achieved more than 92% accuracy.
[1] Learnability and algorithm for continual learning
[2] Prototype Augmentation and Self-Supervision for Incremental Learning
[3] Dynamically Expandable Representation for Class Incremental Learning
[4] Overcoming Catastrophic Forgetting with Hard Attention to the Task
[5] Supermasks in superpositions
[6] A theoretical study on solving continual learning
问题
In addition to the comments in Weaknesses, please answer the following questions.
- The sum in Theorem 1 is made over ||/N parameters. Are parameters between each mode disjoint? If they are disjoint, how are they split? If not disjoint, please report the model size required for training.
This paper proposes a new approach on the continual learning that applying multiple modes and joinlty updated after each task, as an special ensemble based approach. This work aims to maximize the difference of all modes at initial task and then minimize the parameter drift in the following tasks. Experimental results compared with existing work on the task and instance based incremental learning classification benchmarks show that this work could beat others.
优点
- The idea of applying mixture of sub-modes, instead of single model with large amount of parameters, I think is intuitively reasonable, and not so many work focused on parameter based view;
- The author describes specifically on reducing parameter space drift between different tasks, with corresponding theoretical analysis. Experimental result also show the effectiveness of this approach.
缺点
-
During the experiment, the author majorly compared with EWC, I think some recently work that focused on similar idea (not only parameter drift), e.g., ensemble on network. Should also be discussed and compared, for example, Continual Learning Beyond a Single Model, Dynamic Network Expansion and so on [1,2].
-
Some unclear description, e.g., a task has a high level of certainty/uncertainty, how could we measure the degree of such tasks? On sec 3.1, for updating each mode with respect to the input loss, what's the input loss here? The cross-entropy?
-
For the experiment, it do not cover the commonly used Class-IL, or domain-IL setting. Also, for the difference and discussion between task-IL and instance-IL, the author did not describe them more clearly.
[1]. Efficient Continual Learning Ensembles in Neural Network Subspaces. [2]. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning. ICLR 2020.
问题
Please see weakness
Summary:
The paper introduces a novel continual learning strategy that employs multiple parameter modes to mitigate forgetting in multi-task learning environments. The authors propose training these modes in parallel and optimizing task allocation for each mode. The method is supported by an analysis of parameter drift in continual learning and empirically demonstrated to outperform regularization-based baselines like EWC in certain benchmarks.
Strengths:
- The concept of using multiple parameter modes in continual learning is novel and intriguing.
- The paper provides a detailed empirical analysis of parameter drift, offering valuable insights into the optimization behavior of multi-task learners.
Weaknesses:
- Reviewers noted issues with the paper's clarity and organization, making it challenging to follow.
- The paper misses some crucial recent related works, such as CoSCL, which also employs a similar approach.
- The method is primarily tested on simpler benchmarks. Its effectiveness in more complex scenarios remains unexplored.
- The use of ImageNet pre-trained models, which may overlap with the continual learning datasets, raises concerns about information leakage.
为何不给更高分
The main weaknesses of the paper are its lack of clarity and organization, making it difficult to follow, and its insufficient comparative analysis with recent related works, particularly in terms of methodology and empirical validation.
为何不给更低分
N/A
Reject