Gradient-Guided Epsilon Constraint Method for Online Continual Learning
摘要
评审与讨论
The paper introduces the Gradient Guided Epsilon Constraint (GEC), an experience replay-based online continual learning method that formulates single-step parameter updates as an -constraint optimization problem. This enables the learning algorithm to dynamically adjust the strength of the replay buffer gradient during optimization to appropriately prioritize stability vs plasticity. The paper attains strong results on various class-incremental benchmarks.
优缺点分析
Strengths
- The paper tackles the salient problem of dynamically balancing of stability and plasticity during online continual learning, which is important in the field of online continual learning that generally focuses on either stability or plasticity.
- The paper's -constraint perspective of experience replay is not just novel but also useful, as it results in a strong method.
- The paper is very well-written and has strong empirical results on class-incremental OCL benchmarks.
- The experiment section is comprehensive, containing various benchmark results, breakdown of GEC's effects on stability and plasticity, computation details, and more.
Weaknesses
- The main text should refer to various appendix sections, as many useful details such as proofs and algorithms are present there but it is hard to reference them while reading the paper without direct pointers. In addition, a simplified related work section from the appendix should be included in the main text after the experiment section.
- The paper should discuss and, time permitting, compare against recent work that optimizes the experience replay objectives by formulating a single parameter update step as a constrained proximal update [1,2]. These methods also employ the experience replay loss and slow and/or accelerate optimization in parameter directions that do not interfere with model predictions on the replay buffer data. One advantage of GEC that stands out is that it does not involve preconditioning matrix inversion.
- MER baseline results are missing for Seq-CIFAR100 and Seq-TinyImageNet experiments in Tables 1 and 2.
- Nitpicks
- The paper refers to the term "posterior" on multiple occasions, and it does not seem to be related to the Bayesian posterior. Please define this.
- Figure 3c caption should be Seq-TinyImageNet Plasticity, not Seq-TinyImageNet Stability.
References
[1] Yoo, J., Liu, Y., Wood, F. & Pleiss, G.. (2024). Layerwise Proximal Replay: A Proximal Point Method for Online Continual Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:57199-57216.
[2] Urettini, E., & Carta, A. (2025). Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning. arXiv preprint arXiv:2502.01866.
问题
- Would the theoretical findings hold when is approximated with Monte Carlo as is the case in practice?
- What were the values of for the experiments, and how should they be set generally?
- Do you have any idea why Figure 3a has GEC underperforming ER in plasticity early on for Seq-CIFAR100 while Figure 3c has GEC outperforming ER in plasticity for Seq-TinyImageNet?
- What do the dashed lines mean in Figure 1a?
局限性
Yes
最终评判理由
My concerns and questions have been adequately addressed .
格式问题
No concerns
We are grateful for your thorough assessment and encouraging feedback. We are delighted that you found our -constraint perspective "novel and useful," the paper "very well-written," and the empirical results "strong." Your insightful comments will help us improve the quality of our work. We address each of your points below.
W1: Appendix Referencing and Related Work
This is an excellent suggestion to improve the paper's readability and self-containment. We fully agree and will implement the following in the final version:
- We will add explicit pointers (e.g., "(see Appendix A.1 for proof)") throughout the main text to directly link to proofs, algorithms, and other details.
- Also, we will add a new "Related Work" section to the main paper before the Conclusion. It will distill the most critical comparisons and provide context, including the discussion of the works you suggested [1][2].
W2: Comparison with Constrained Proximal Update Methods [1, 2]
We sincerely thank you for bringing these recent papers to our attention. They represent an important and complementary direction in OCL, and a direct comparison helps clarify GEC's contributions. We will add a comparison in our new "Related Work" section:
-
First, GEC stems from a MOO perspective, explicitly balancing plasticity and stability via a dynamic -constraint [3]. In contrast, LPR [1] and OCAR [2] approach the problem from an optimizer design perspective, using proximal methods and second-order information to stabilize the optimization process.
-
Second, the GEC's update is dynamic. The coefficient is recalculated at every step based on the current forgetting level . When forgetting is low (), can be zero, focusing entirely on plasticity. When forgetting is high (), increases to prioritize stability. In contrast, the preconditioning in LPR/OCAR remains fixed between updates.
-
As you pointed out, GEC is highly efficient. Our update has a closed-form solution requiring no matrix inversion. This is a significant practical advantage over methods that require inverting matrices.
W3: Missing MER baseline results
The MER baseline was omitted for Seq-CIFAR100 and Seq-TinyImageNet due to its significant computational demands. As a MAML-based method, MER lead to prohibitive training times and instability on these longer benchmarks. We will add a note in the experimental setup to clarify this decision, citing [4].
Nitpicks
- Thank you for this sharp-eyed catch of the use of "posterior". Our use was in the non-Bayesian sense of "after-the-fact." To avoid ambiguity, we will replace all instances with the more precise term "post-hoc".
- We are grateful for your meticulous reading of Figure 3c'scaption. The caption is incorrect and should be "Seq-TinyImageNet - Plasticity (Acc-P)". We will correct this.
Q1: Would the theoretical findings hold when approximated with Monte Carlo?
This is a very insightful question. Our theoretical analysis is presented in a deterministic setting for motivational clarity. In practice, where gradients are Monte Carlo estimates, the key insight holds because the stochastic gradients are unbiased estimates of the true gradients [5,6]. Over iterations, we expect the optimization trajectory to converge in a way that respects the constraints on average. Our empirical results, obtained entirely with mini-batches, serve as a validation that GEC's core idea is effective and resilient to the stochasticity of the OCL setting.
Q2: What were the values of and , and how should they be set?
This is a very insightful question about the practical application of GEC. We used and for all experiments, which proved robust across all benchmarks. Here is a general guide to setting them:
- is the principal hyperparameter controlling the stability-plasticity trade-off. A smaller enforces stricter forgetting constraints (leading to more stability), while a larger allows more flexibility to learn the new task (more plasticity). As illustrated in our Figure 4, varying allows a user to effectively trace a Pareto front of high-performing solutions to match their specific needs.
- controls the strength of the corrective action when the constraint is violated. We found that the performance is quite robust to the choice of . To demonstrate this empirically, we conducted a sensitivity analysis on Seq-CIFAR100.
Table 1: Different on Seq-CIFAR100 (|M|=5k)
| AAA | Acc | |
|---|---|---|
| 0.2 | 41.95 | 36.80 |
| 0.5 | 42.47 | 37.12 |
| 0.8 | 42.11 | 36.91 |
As the table1 shows, performance remains high and very stable across a sensible range of values. The variation in both AAA and Acc is minimal. We will add this ablation study to the appendix.
Q3: Why does GEC's plasticity differ between Seq-CIFAR100 and Seq-TinyImageNet in Figure 3?
This is a good observation. Our hypothesis centers on the varying forgetting pressure of the datasets[7]:
- On Seq-CIFAR100, with high inter-task confusion, the constraint () is triggered more often, especially early on. GEC correctly prioritizes stability, sometimes at the cost of short-term plasticity. This is GEC working as intended: defending against severe forgetting.
- But on Seq-TinyImageNet, with more distinct classes, the constraint is more easily satisfied (). Here, GEC's advantage shifts to preventing overfitting to the memory buffer. Unlike ER's fixed weighting, GEC can dedicate its full update to the current task once the constraint is met, leading to superior plasticity.
Q4: What do the dashed lines mean in Figure 1a?
Thank you for seeking clarification. The dashed lines represent the iso-objective contours for ER's scalarized loss function, i.e., , where is a constant. The slope of these lines is . The optimal solution for a given is found at the point where one of these lines is tangent to the achievable Pareto front. We will add this explanation to the figure caption to make it clear.
[1] Yoo, J., Liu, Y., Wood, F. & Pleiss, G.. (2024). Layerwise Proximal Replay: A Proximal Point Method for Online Continual Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:57199-57216.
[2] Urettini, E., & Carta, A. (2025). Online Curvature-Aware Replay: Leveraging 2nd Order Information for Online Continual Learning. arXiv preprint arXiv:2502.01866.
[3] Miettinen K. Nonlinear multiobjective optimization[M]. Springer Science & Business Media, 1999.
[4] Wu Y, Wang H, Zhao P, et al. Mitigating catastrophic forgetting in online continual learning by modeling previous task interrelations via pareto optimization[C]//Forty-first international conference on machine learning. 2024.
[5] Robbins H, Monro S. A stochastic approximation method[J]. The annals of mathematical statistics, 1951: 400-407.
[6] Bottou L. Stochastic gradient descent tricks[M]//Neural networks: tricks of the trade: second edition. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 421-436.
[7] Goldfarb D, Evron I, Weinberger N, et al. The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting--An Analytical Model[J]. arXiv preprint arXiv:2401.12617, 2024.
Thank you for the clarifications. I have no further questions.
Thank you for your confirmation. We appreciate your time and the insightful feedback you provided throughout the review process. Your comments have been instrumental in improving our work.
The paper reveals that experience replay (ER) in Online Continual Learning (OCL) implicitly employs a soft constraint with a fixed slack, thereby risking suboptimal updates and overfitting. We analyze ER as an ε-constraint optimization problem, revealing its limitations. We propose Gradient-Guided Epsilon Constraint (GEC), explicitly formulating OCL updates as an ε-constraint problem: minimize current loss while enforcing stability constraints on memory data. GEC dynamically adjusts the update direction using gradients: if memory loss exceeds a slack tolerance, it prioritizes constraint satisfaction; else, it focuses on the current task while controlling memory loss increase. Evaluations show GEC achieves a superior stability-plasticity trade-off, improving overall performance.
优缺点分析
Strengths: 1.The paper is well-organized, with clear theoretical derivations, intuitive visualizations and a logical flow from problem formulation to evaluation. 2.The paper provides a novel perspective by reinterpreting Experience Replay (ER) as an ε-constraint optimization problem. This insight bridges the gap between rehearsal-based methods and constraint-based approaches, offering a unified framework for understanding trade-offs in online continual learning 3.The proposed Gradient-Guided Epsilon Constraint (GEC) method introduces a dynamic, gradient-aware mechanism to balance plasticity and stability. By adjusting the update direction based on constraint violation, GEC avoids overfitting to the memory buffer while mitigating catastrophic forgetting. This design is both theoretically grounded and empirically validated. 4.The experiments on standard OCL benchmarks demonstrate consistent improvements over state-of-the-art methods like ER, POCL, and CBA. The ablation studies and analysis of generalization gain vs. forgetting measures strengthen the claims.
Weaknesses: 1.The quadratic programming (QP) formulation in GEC may introduce computational overhead compared to simpler methods like ER. The paper briefly mentions training time comparisons (Table 3) but does not analyze how this overhead scales with larger datasets or more complex models. 2.The analysis assumes convexity and differentiability of the loss functions. In real-world non-convex scenarios, the guarantees for constraint adherence and convergence may not hold.
问题
1.How does the QP formulation scale with increasing task complexity or dataset size? Are there approximations (e.g., closed-form solutions) to reduce computational overhead? 2.How does GEC perform in non-convex scenarios? What modifications are needed to handle non-convexity?
局限性
Yes.
最终评判理由
My concerns are well addressed.
格式问题
No
We sincerely thank the reviewer for the insightful and positive feedback. We are encouraged that you recognized our paper's clear organization, the novelty of our -constraint perspective, and the strength of our proposed method and empirical results. You have raised two crucial points regarding computational overhead and the convexity assumption, which are vital for practical applications. We appreciate this opportunity to provide a detailed clarification.
W1/Q1: QP Computational Overhead and Scalability
This is a critical point regarding the potential computational overhead of the QP formulation. We would like to clarify and emphasize a key aspect of our method that directly addresses this concern: the specific QP problem in GEC, as defined in Eq. (8), admits a closed-form analytical solution, which we present in Eq. (11).
This means we do not need to run an iterative QP solver, and no further approximations that might compromise performance are necessary. The computation of the optimal update direction v_k involves only vector operations that scale linearly with the number of model parameters (d).
To further quantify this and empirically validate your intuition, we conducted a surgical experiment on Seq-CIFAR10 to precisely isolate the costs of GEC's components. We compared standard ER, our full GEC method, and a variant GEC (w/o QP) which performs the dual forward/backward passes but uses a simple gradient addition instead of our dynamic QP logic.
Table 1: Ablation on Computational Cost of GEC Components on Seq-CIFAR10 (|M|=1k)
| Method | Pass | QP | Training Time (s) | Acc |
|---|---|---|---|---|
| ER | Single | No | 86.31 | 42.04 |
| GEC (w/o QP) | Dual | No | 132.89 | 45.17 |
| GEC (Ours) | Dual | Yes | 147.62 | 59.12 |
And we can see that the primary increase in training time arises from the dual forward/backward passes, which are essential to compute the constraint violation . The additional cost of solving the QP via its closed-form solution is negligible. This efficiency is also reflected in Table 3 of our main paper, where GEC's training time is substantially lower than other advanced OCL methods like POCL and CBA. We will add more complexity analysis in Appendix in the final manuscript.
W2/Q2: Convexity Assumption and Non-Convex Performance
This is another insightful point, and we appreciate the opportunity to clarify the precise and limited role of the convexity assumption in our work.
You are correct that our theoretical analysis uses a convexity assumption. However, it's crucial to distinguish where and why it is used:
-
First, the assumption of convexity is used unidirectionally in Proposition 2 to formally establish the equivalence from a standard ε-constraint problem back to ER's weighted-sum objective. This proof relies on the Karush-Kuhn-Tucker (KKT) conditions, where convexity is required for the conditions to be sufficient for global optimality [1]. This step serves to provide a rigorous theoretical grounding for ER's behavior within the established optimization literature.
-
The analysis of our proposed GEC method's properties (Section 5.1 and Appendix B.2) does NOT depend on the convexity of the loss functions and . Our theoretical guarantees (Constraint Adherence, Objective Minimization) are derived from the KKT conditions of the local QP problem defined in Eq. (8). This QP, which seeks the optimal update direction at each step, is always convex by its own construction (a quadratic objective with a single linear constraint), irrespective of the global non-convexity of the underlying deep learning loss landscape. This analytical approach is a standard tool for studying the behavior of optimization algorithms in non-convex settings [2] [3].
Therefore, GEC as presented is designed for and validated in the practical, non-convex setting of deep neural networks, and thus no specific modifications are needed. Moreover, the SOTA performance of GEC across all our experiments using the non-convex ResNet architecture (Tables 1, 2 and results in our response to Reviewer bPNr) serves as the validation of its effectiveness in these non-convex deep learning scenarios.
[1] Boyd S P, Vandenberghe L. Convex optimization[M]. Cambridge university press, 2004.
[2] Su W, Boyd S, Candes E J. A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights[J]. Journal of Machine Learning Research, 2016, 17(153): 1-43.
[3] Miettinen K. Nonlinear multiobjective optimization[M]. Springer Science & Business Media, 1999.
Thanks for the response. I have no further questions.
Thank you for the confirmation. We are grateful for your constructive review, which has helped us to strengthen the manuscript.
This work investigates the online continual learning (OCL) problem and revisit the classical strategies of Experience Replay (ER), and constraint-based optimization methods such as GEM and A-GEM. The paper first show that ER can be formulated as a soft-constraint optimization problem, revealing its underlying behaviours of why it could achieve a better performance. From there, the authors proposed Gradient Guided Epsilon Constraint (GEC) which dynamically adjusts the slack variable within this optimization framework. Experiments on common OCL benchmark show encouraging results of GEC.
优缺点分析
Strengths
- I like the theoretical analysis in this work. Although not too complex, it is original, easy to follow, and provided useful insights.
- The proposed GEC algorithm is well-motivated and seems to work well on common benchmarks.
- The authors conducted extensive ablation studies and also reported the training complexity.
Weaknesses
- My major concern of this work is the benchmarks considered are quite outdated, and small in scale. It would be more interesting to validate the algorithm on more challenging scenarios, such as the long-tail setting [A], on larger datasets (e.g. ImageNet), or in NLP/RL/Robotics domains.
- How does GEC interact with the BatchNorm (BN) component in during OCL? Specifically, the original ER mixes current and previous data samples in a mini batch and perform a single forward to update. In contrast, GEC performs two forward passes on the current task and previous task data, which results in different updating dynamic of the BN's global statistics than ER. A prior study [B] showed that this difference may impact OCL.
[A] Kim, Chris Dongjoo, Jinseo Jeong, and Gunhee Kim. "Imbalanced continual learning with partitioning reservoir sampling." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. Springer International Publishing, 2020.
[B] Pham, Quang, Chenghao Liu, and Steven Hoi. "Continual normalization: Rethinking batch normalization for online continual learning." ICLR, 2022.
问题
- How would GEC perform on more challenging benchmarks?
- Could you clarify the interaction between GEC and BN and compared it to ER?
局限性
yes
最终评判理由
Overall a solid work, and the author addressed my concerns in the rebuttal. Therefore, I recommend acceptance.
格式问题
N/A
We sincerely thank the reviewer for providing valuable feedback. We are encouraged that the reviewer found our theoretical analysis "original, easy to follow, and provided useful insights" and the GEC algorithm well-motivated. We detail our response below point by point.
W1/Q1: Performance on more challenging benchmarks
We thank the reviewer for this insightful suggestion. To demonstrate GEC's robustness, we conducted extensive new experiments in three challenging settings. Blew are the three experimental settings:
- Following [4], we introduced long-tail distributions to Seq-CIFAR10 and Seq-CIFAR100 with varying imbalance factors ( and ) and distribution orders (Normal, Reversed).
- Following [3], we introduced three types of label noise ('flip1', 'flip2', 'uniform') at different corruption ratios to evaluate robustness.
- Following [5], we re-split Seq-TinyImageNet from 20 to a more challenging 40-task sequence, denoted as Seq-Tiny-l.
We compare GEC against strong SOTA baselines (DER++[1], CLSER[2], POCL[3]).
Table 1: Performance on Long-Tail Distributions
| Setting (γ) | Dataset | Metric | DER++ | CLSER | POCL | GEC |
|---|---|---|---|---|---|---|
| γ=2 | Seq-C10, Normal | AAA | 61.97 | 61.87 | 66.87 | 67.11 |
| Acc | 44.04 | 48.04 | 54.82 | 55.43 | ||
| Seq-C10, Reversed | AAA | 58.52 | 55.32 | 64.45 | 64.47 | |
| Acc | 39.43 | 42.38 | 58.79 | 58.94 | ||
| Seq-C100, Normal | AAA | 23.45 | 22.57 | 24.13 | 26.98 | |
| Acc | 13.77 | 12.61 | 15.20 | 18.20 | ||
| Seq-C100, Reversed | AAA | 21.19 | 16.41 | 21.58 | 23.41 | |
| Acc | 14.54 | 11.47 | 14.16 | 15.43 | ||
| γ=5 | Seq-C10, Normal | AAA | 49.64 | 53.83 | 54.32 | 56.17 |
| Acc | 41.22 | 35.72 | 42.13 | 44.06 | ||
| Seq-C10, Reversed | AAA | 49.16 | 51.37 | 52.52 | 54.15 | |
| Acc | 39.42 | 40.19 | 36.10 | 44.72 | ||
| Seq-C100, Normal | AAA | 22.84 | 19.63 | 21.70 | 24.99 | |
| Acc | 16.17 | 9.65 | 14.67 | 17.12 | ||
| Seq-C100, Reversed | AAA | 14.63 | 13.11 | 17.54 | 20.51 | |
| Acc | 11.14 | 7.34 | 12.58 | 16.03 |
Table 2: Performance under Label Noise
| Noise Type | Dataset (Ratio) | Metric | DER++ | CLSER | POCL | GEC |
|---|---|---|---|---|---|---|
| flip1 | Seq-C10 (0.05) | AAA | 60.93 | 62.62 | 61.14 | 63.20 |
| Acc | 49.60 | 52.63 | 51.17 | 52.75 | ||
| Seq-C10 (0.1) | AAA | 58.58 | 58.45 | 59.20 | 63.13 | |
| Acc | 44.84 | 44.30 | 47.96 | 52.37 | ||
| Seq-C100 (0.05) | AAA | 24.16 | 21.69 | 22.38 | 26.76 | |
| Acc | 15.40 | 13.34 | 15.41 | 16.50 | ||
| Seq-C100 (0.2) | AAA | 21.80 | 19.18 | 18.99 | 22.11 | |
| Acc | 12.84 | 10.69 | 11.41 | 11.37 | ||
| flip2 | Seq-C10 (0.05) | AAA | 63.21 | 61.19 | 62.11 | 66.03 |
| Acc | 49.31 | 48.47 | 49.04 | 54.69 | ||
| Seq-C10 (0.1) | AAA | 60.66 | 65.15 | 60.90 | 62.90 | |
| Acc | 41.40 | 52.68 | 42.18 | 49.67 | ||
| Seq-C100 (0.05) | AAA | 22.88 | 21.96 | 21.59 | 25.88 | |
| Acc | 14.83 | 10.53 | 11.30 | 15.63 | ||
| Seq-C100 (0.2) | AAA | 22.26 | 19.15 | 19.83 | 22.32 | |
| Acc | 11.20 | 10.68 | 11.01 | 15.36 | ||
| uniform | Seq-C10 (0.05) | AAA | 63.34 | 63.19 | 62.88 | 64.94 |
| Acc | 49.18 | 50.87 | 49.21 | 51.72 | ||
| Seq-C10 (0.1) | AAA | 56.56 | 59.63 | 58.42 | 62.83 | |
| Acc | 45.00 | 45.37 | 43.11 | 48.12 | ||
| Seq-C100 (0.05) | AAA | 24.37 | 22.38 | 22.43 | 25.73 | |
| Acc | 15.22 | 12.00 | 12.87 | 17.30 | ||
| Seq-C100 (0.2) | AAA | 20.94 | 19.62 | 18.82 | 21.05 | |
| Acc | 12.28 | 10.47 | 10.06 | 11.53 |
Table 3: Performance on a Longer Task Sequence (Seq-Tiny-l, N=40)
| Method | AAA | Acc |
|---|---|---|
| SGD | 4.95±0.16 | 1.19±0.27 |
| ER | 13.75±0.12 | 6.82±0.11 |
| DER++ | 9.39±0.07 | 3.76±0.81 |
| CLSER | 14.93±0.36 | 7.74±0.91 |
| POCL | 18.41±0.14 | 10.64±0.33 |
| GEC | 25.37±0.16 | 11.89±0.31 |
- GEC's constraint is on the memory loss () with a slack tolerance (). This design naturally dampens the impact of high-variance gradients from imbalanced or noisy samples in the buffer. Unlike fixed-weighting methods (e.g., ER) which can be easily misguided, GEC's updates are more stable and principled (table1&2).
- The compelling lead in the 40-task scenario, as shown in table3, highlights GEC's key advantage. By explicitly monitoring and correcting for forgetting, GEC prevents the gradual error accumulation that plagues methods with fixed trade-offs over long horizons.
These results confirm that GEC's is also effective for complex, realistic CL settings. We will include these results into the final paper.
W2/Q2: Interaction with BatchNorm (BN)
This is a sharp and important observation. To rigorously isolate GEC's algorithmic contribution from potential BN effects, we conducted two controlled experiments.
-
We created a baseline, ER (Dual-Fwd), that mimics GEC’s dual forward pass structure, ensuring identical BN update dynamics for a fair comparison.
-
We also replaced BatchNorm with the stateless GroupNorm (GN) for both ER and GEC like [6].
Table 4: Performance comparison isolating the effect of BN dynamics.
| Dataset | Metric | ER | ER (Dual-Fwd) | GEC |
|---|---|---|---|---|
| Seq-C10 (1k) | AAA | 54.91 | 55.70 | 69.42 |
| Acc | 42.04 | 44.67 | 59.12 | |
| Seq-C100 (5k) | AAA | 20.67 | 20.99 | 42.47 |
| Acc | 14.87 | 15.59 | 37.12 | |
| Seq-TinyNet (5k) | AAA | 16.10 | 16.88 | 31.36 |
| Acc | 11.89 | 13.11 | 20.25 |
Table 5: Performance comparison with Group Normalization (GN) on Seq-CIFAR100 (|M|=5k).
| Method | AAA | Acc |
|---|---|---|
| ER-GN | 19.85 | 14.42 |
| GEC-GN | 41.03 | 35.88 |
- Table 4 shows that matching BN dynamics (ER (Dual-Fwd)) provides marginal gains over standard ER. This proves GEC's substantial performance leap stems from its core gradient-guided constraint mechanism.
- Table 5 further solidifies this. With GroupNorm, GEC still maintains its significant lead, demonstrating its effectiveness is independent of the normalization layer's statefulness.
We hope these experiments have fully addressed your concerns. Your feedback has been invaluable in strengthening our paper, and we will incorporate these results into the appendix.
[1] Dark experience for general continual learning: a strong, simple baseline. NeurIPS 2020
[2] Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System. ICLR 2022
[3] Mitigating Catastrophic Forgetting in Online Continual Learning by Modeling Previous Task Interrelations via Pareto Optimization. ICML 2024
[4] Imbalanced continual learning with partitioning reservoir sampling. ECCV 2020.
[5] Meta Continual Learning Revisited: Implicitly Enhancing Online Hessian Approximation via Variance Reduction. ICLR 2024
[6] Continual normalization: Rethinking batch normalization for online continual learning. ICLR 2022.
Thank you for the clarification. I have no further questions.
Thank you for your reply. We appreciate all your valuable feedback.
This paper proposes an experience replay-based online continual learning method (called Gradient guided Epsilon Constraint, GEC) that formulates single-step parameter updates as an ϵ-constraint optimization problem. I would suggest the authors to take the time to incorporate the reviewers feedback, together with the additional experiments on more challenging benchmarks in the final version of the manuscript. Congratulation on getting your paper accepted.