PaperHub
6.3
/10
Rejected3 位审稿人
最低3最高4标准差0.5
4
3
3
ICML 2025

Temporal-Difference Variational Continual Learning

OpenReviewPDF
提交: 2025-01-24更新: 2025-06-18

摘要

关键词
continual learningonline variational inferencetemporal-difference learning

评审与讨论

审稿意见
4

This paper proposes n-step generalisation of the classical variational continual learning (VCL) framework, which aims at addressing the potential variability and subsequent compounding approximation error in regularising the KL-divergence between the current posterior approximation and the immediate preceding posterior. The authors present equivalent reformulations of the classical VCL objective via decomposing the one-step variational objective into multi-step objective, leveraging the Bayesian recursion. They also present a TD-version of n-step VCL, amplifying the regularisation given more recent posterior approximations. The resultant loss objectives were evaluated under the Bayesian deep learning setting, and the authors introduced three new and more challenging benchmarks for CL evaluation. The empirical results indicate the n-step and TD-VCL indeed improves CL performance on various benchmarks.

给作者的问题

See above.

论据与证据

The claims are largely correct, with some statements incorrectly or unclearly stated. Please find below some comments and questions.

  • The authos claim "Maximizing the objective in Equation 3 is equivalent to the optimization in Equation 2". This is an incorrect statement, maximising the objective in Equation 3 is equivalent to maximising a lower-bound of the objective presented in Equation 2. Moreover, this is not due to the approximation error in estimating the log-likelihood terms or the KL terms, but due to the explicit derivation of the lower bound based on Jensen inequality.
  • Posterior distributions under the classical VCL framework contains (implicit) deviation constraints from a sequence of past estimations, maybe not as explicit as the TD-VCL proposed in the paper. However, from perspective, the TD-version of posterior approximation deviates further from the true posterior, and this could be easily verified through continual learning on simple graphical models, such as the variational GMM. I am curious to find out if the VCL-posterior indeed "compounds approximation errors and deviates further from the true posterior", more so than TD-VCL.
  • Consider the broader problem setting of CL when the task switches quickly, then the aggregation of past posterior approximations might correspond to some made-up task representation that does not correspond to any of the preceding task, hence does not contribute to resolving catastrophic forgetting. How would the authors propose to addres this problem?
  • I agree posterior approximation error compounds with successive recursive updates, but this is true for both classical VCL and TD-VCL.
  • I am confused by the motivation behind TD-VCL. Intuitively, one would imagine tasks that were presented earlier in time should lead stronger catastrophic forgetting, hence requiring harsher constraining, whereas the motivation behind TD-VCL is quite the opposite. The recency effect should be implicit preserved in standard training dynamics. I am curious to see what would happen if the weighting scheduling is reversed.

方法与评估标准

Yes.

理论论述

I have checked all derivations in the appendix, and the proofs are easy to follow and error-free.

实验设计与分析

I checked the experimental details in the appendix, and the design of the new CL benchmarks, I did not identify any issue with the implementation.

补充材料

I did review the appendix, I did not go through the code implementations.

与现有文献的关系

The paper centers around continual learning, which is an important and actively studied field of machine learning.

遗漏的重要参考文献

No.

其他优缺点

Strengths of the paper:

  • The paper is clearly written, both the methods and the experiment sections are easy to follow.
  • Empirical evaluation indeed shows that the proposed models outperform competing baselines under the CL setup.
  • The three benchmarks introduced in the paper is potentially valuable to the field.

See weaknesses and questions in "Claims And Evidence".

其他意见或建议

See above,

作者回复

Thank you for the review! We appreciate that you found our work well-written and clear, the empirical results supportive, and our introduced benchmarks a potentially valuable contribution. You raised great questions, which we address below:

Q1 Are Eqs 2 and 3 equivalent? Is Eq 3 a lower bound of Eq 2? What is the source of approximation errors?

A1 We argue that Eq 3 is equivalent to maximizing a lower bound of the marginal likelihood (evidence) and not a lower bound of Eq 2. And the optimization w.r.t θ\theta for both objectives is equivalent. We show this with an ELBO derivation:

\underbrace{\mathscr{D\}_{KL}(q(\theta) \mid \mid \frac{1}{Z\_{t}} q\_{t-1}(\theta)p(\mathcal{D}\_{t} \mid \theta))}\_{L\_{2}(\theta)} = \mathbb{E}\_{q(\theta)}[log \frac{q(\theta)}{q\_{t-1}(\theta)} - log(p(D\_{t} \mid \theta)) + logZ\_{t}] =\underbrace{\mathscr{D}\_{KL}(q(\theta) \mid \mid q\_{t-1}(\theta)) - \mathbb{E}\_{q(\theta)}[log(p(D\_{t} \mid \theta))]}\_{-L\_{3}(\theta)} + logZ\_{t}

where L2L_{2} and L3L_{3} are the terms in Eqs. 2 and 3. Note that minimizing L2L_{2} is equivalent to maximize L3L_{3} as ZtZ_{t} is constant w.r.t the optimization parameters. Since L20L_{2} \geq 0 (it is a KL divergence), then L3logZtL_{3} \leq log Z_{t}.

Eq 3 is effectively a lower bound on the evidence. Indeed, the approximation error is due to the lower-bound, but the gap is exactly quantified by the KL term in Eq 2. We identify two error sources: computing the objective itself (due to sampling errors or biases from previous estimates) and the inherent choice of the variational family (which introduce biases for the subsequent steps). We note that deriving the ELBO with Jensen's inequality is equivalent, but the presented derivation directly quantifies the lower bound gap.

Q2 Implicit (VCL) vs Explicit (TD-VCL) Regularization in previous posteriors and the effect in error compounding (Points 2 and 4)

A2 Indeed, both VCL and TD-VCL optimize the same KL div in Eq 2. The difference lies in how the approximation errors across successive steps interact: we refer to our geometric interpretation in Figure 1. If the errors do not follow a particular pattern in the parameter space, we argue that explicitly regularizing in previous posteriors exerts a corrective influence ("canceling out" errors). Naturally, if all posterior estimates exhibit similar error patterns (intermediate errors in the same "direction"), then TD-VCL and VCL would behave and compound errors equivalently. We argue this needs one to adversarially pick a particular variational family distribution and optimization algorithm. In practice, however, we observe that there is indeed a corrective influence, and TD-VCL objectives work better, as shown by the presented experimental validation.

Q3 Setup where tasks change quickly

A3 We assume "tasks changing quickly" indicates few data points per task. This "low-data" regime should be handled by the Bayesian framework itself – the sequence of posteriors would be closer to the initial prior, regardless of the variational objective adopted. The prior encodes all knowledge we know a priori from the tasks. It is better to be closer to the prior, as overfitting to a particular task (like what MLE would do) can be very detrimental for other tasks, harming plasticity. Thus, the posteriors might not lead to good downstream performance due to the lack of data, but the posterior is still useful by not allowing plasticity loss. The Bayesian model also expresses epistemic uncertainty, which gives another layer of interpretability to the predictions that may be leveraged to prevent a very uncertain model from making bad decisions in these scenarios.

Q4 Intuitively, tasks presented earlier should lead to stronger catastrophic forgetting. Then why does TD-VCL prioritize more recent posteriors?

A4 Your intuition is often correct and supported by the findings. Yet, constraining the current posterior to a past posterior qtq_{t} goes beyond constraining to the knowledge about task tt but comprises the whole history of precedent tasks t,t1,...t, t-1, .... The recursive property in Eq 1 allows information from all past tasks to flow up to the posterior estimation. Constraining harder to an older posterior might help the corresponding task and previous ones, but also disregards subsequent ones. Older posteriors are not "aware" of newer tasks, while recent ones are conditioned in a longer history. Thus, it makes sense to give more weight to recent estimations.

TD-VCL actually constrains in many posteriors that are conditioned in an older task. At timestep t+1t+1, Task tt is only accounted in posterior qtq_{t} while Task 0 is accounted in all posteriors. This leads to a stronger constraining on an older task, which does not require a stronger constraining to the posterior of the corresponding timestep. This is a nice property that VCL does not enforce explicitly.

审稿意见
3

The paper introduces a new variant of variational continual learning that integrates ideas from temporal-difference (TD) methods to mitigate error accumulation across tasks. Instead of regularizing solely against the immediately preceding posterior as in standard VCL, the proposed method uses multiple past posterior estimates (through n-step and TD(λ) formulations) to better balance plasticity and stability in a sequential, recursive update framework.

给作者的问题

– How do you justify the omission of normalization constants in a sequential setting, and can you provide bounds on the potential bias introduced? – Can you quantify the approximation error introduced by the Gaussian mean-field assumption in your recursive framework? – What are the precise regularity conditions (e.g., differentiability, smoothness) required for your derivations, and how robust is your method if these conditions are violated? – Is it possible to extend the convergence properties or derive error bounds from classical TD learning to your variational objective, thereby providing a more rigorous theoretical foundation?

论据与证据

The authors claim that their approach reduces catastrophic forgetting by addressing the compounding of approximation errors inherent in recursive updates. They support these claims with detailed theoretical derivations and experiments that demonstrate improved performance over standard VCL on MNIST-based benchmarks. However, the experimental evidence is limited in scope, mostly relying on relatively simple datasets, and some reported accuracy figures are lower than expected compared to prior work.

方法与评估标准

Methodologically, the paper extends the VCL framework by incorporating multiple past posteriors into the variational objective and drawing an analogy to TD learning. The evaluation criteria focus on improvements in average accuracy and the ability to mitigate forgetting on benchmarks such as PermutedMNIST and SplitMNIST. Nonetheless, the evaluation does not include more challenging or modern datasets, limiting the broader impact of the empirical results.

理论论述

On the theory side, the paper presents a series of derivations that reframe the VCL objective as a discounted sum of n-step TD targets. While the derivations are mathematically detailed, there are several concerns. First, the derivations omit normalization constants (e.g., Zₜ) by assuming they are independent of the variational parameters; this could lead to biases in the sequential updates if these constants actually have any parameter dependence. Second, the use of a Gaussian mean-field approximation is not accompanied by any quantification of the approximation error, leaving open questions about its impact on the overall theoretical guarantees. Third, key derivations rely on L’Hôpital’s rule without explicitly stating the required regularity conditions such as strict differentiability and smoothness; the paper does not discuss scenarios in which these conditions might fail. Finally, while the authors draw an analogy to TD learning, the variational objective is not exactly a Bellman equation, and there is no rigorous derivation that extends known convergence properties or error bounds from classical TD learning to this framework.

实验设计与分析

The experimental design primarily tests the proposed method on MNIST variants (e.g., PermutedMNIST and SplitMNIST), which are increasingly seen as toy problems. The analysis shows some improvement over baseline VCL methods, yet the overall accuracy levels are unexpectedly low when compared with previous VCL literature. This limitation raises concerns about both the practical viability of the approach and its scalability to more complex, real-world scenarios.

补充材料

The supplementary material is extensive and includes detailed proofs, hyperparameter settings, and ablation studies.

与现有文献的关系

The paper builds upon established Bayesian continual learning literature, particularly VCL and its variants. However, it does not engage sufficiently with more recent advances in continual learning, including methods that use replay buffers or alternative regularization strategies. This gap limits the paper’s ability to position its contributions within the rapidly evolving landscape of continual learning research.

The authors demonstrate a strong command of the Bayesian continual learning literature, with detailed theoretical derivations and an extensive list of references. However, the literature review could be expanded to include more recent empirical methods that have set higher benchmarks in continual learning.

遗漏的重要参考文献

There is a notable absence of discussion regarding recent methods that tackle catastrophic forgetting using stronger empirical benchmarks and more scalable architectures. References to works employing replay-based methods or other modern continual learning strategies would help situate the contribution more effectively.

其他优缺点

Strengths include the innovative idea of combining TD learning concepts with variational continual learning, a rigorous set of derivations, and thorough supplementary material. Weaknesses include the reliance on simplifying assumptions (such as IID tasks and neglect of normalization constants), the lack of quantified error bounds for the mean-field approximation, missing explicit regularity conditions for the derivations, and an experimental evaluation that is too narrow in scope to convince that the method scales well beyond toy datasets.

其他意见或建议

The paper would benefit greatly from a more rigorous theoretical discussion on the role and impact of the hyperparameters n and λ, including formal error bounds and convergence analyses that draw on or extend TD learning theory. Additionally, expanding the experimental evaluation to include more challenging benchmarks (e.g., CIFAR100, TinyImageNet) would help validate the method’s practical relevance.

作者回复

Thank you for your review! We appreciate the recognition of our theoretical derivations, extensive hyperparameter settings, and ablation studies. We're also grateful that you found our work innovative in combining TD learning ideas with Variational Continual Learning and demonstrating a strong grasp of Bayesian CL literature. We aim to address your concerns below:

Q1 The work should expand the experimental evaluation to include more challenging benchmarks (e.g., CIFAR100, TinyImageNet); current experimental evidence is limited to PermutedMNIST/SplitMNIST and reports lower accuracy than prior work.

A1 We highlight that our current paper does present an experimental evaluation on CIFAR100 and TinyImageNet. We refer to Tables 2/3 and Appendix L for the results. TD-VCL attained superior performance against other methods, as discussed in Section 5.1. As the reviewer stated, these are challenging benchmarks and should provide good experimental evidence to support our claims.

We also clarify that our experimental evidence goes beyond Permuted/SplitMNIST. In fact, we improve upon these benchmarks, introducing novel, more challenging versions that impose memory and architectural restrictions (namely Permuted/Split/SplitNotMNIST-Hard). As highlighted by reviewer c7SD, this is a potentially valuable contribution to the field. Our work actually reports higher accuracy than prior methods in all considered benchmarks, as shown in Tables 1 and 3.

Q2 The derivations assume that the normalization constant is independent of the variational parameters, which could lead to biases.

A2 By definition, the normalization constant (evidence or marginal likelihood) is independent of the parameter distribution, as these are marginalized out: Z=p(D)=θp(Dtθ)p(θ)dθZ = p(D) = \int_{\theta}p(D_{t}\mid \theta)p(\theta)d\theta. This is a crucial aspect for Variational Inference (VI) and for deriving an evidence lower bound for tractable objectives, including variational CL. As also explicitly stated by the VCL paper (Sec. 2.1) [1], "Zt is the intractable normalizing constant of ptp^{*}_{t} and is not required to compute the optimum". Assuming a proper choice of variational distribution and optimization procedure, the learned posterior may theoretically achieve zero KL divergence w.r.t. to the true posterior, both in VCL and TD-VCL variants.

One important clarification is that, at timestep t, the optimization is w.r.t the variational distribution qtq_{t}, which does not influence qt1q_{t-1}. This "prior" qt1q_{t-1} (and any previous posterior estimate used) is a fixed distribution for this optimization. There is no backpropagation through time.

Q3 Quantifying the Gaussian mean-field (MF) approximation error.

A3 The Gaussian MF approximation is standard in VCL objectives [1-3] and widely used in VI for its tractability and convenience. Assuming no optimization error, the approximation error from the choice of distribution family is quantified by the KL divergence between the learned variational distribution and the true posterior. Statistically quantifying this divergence in VI remains an active research area [4], often requiring simplifying assumptions about the true posterior. While valuable, this is beyond our scope — we focus on the algorithmic work of deriving a new tractable optimization objective for variational CL and empirically demonstrating its improved posterior approximation in downstream predictive tasks. Finally, prior work shows that for deep networks (our case), MF approximation is not too restrictive, and the bigger your model, the easier it is to be approximately Bayesian [5].

Q4 L’Hôpital’s rule and necessary regularity conditions.

A4 Our only use of L'Hopital's rule is in Appendix E. The "key" derivations of our proposed objective are in Appendices A/B/D, which do not use it. In Appendix E, we apply the rule to two functions. They are in the form f(x)/g(x)f(x)/g(x), where the numerator and denominator are differentiable and g(x)0g'(x) \neq 0 in the considered interval (0, 1). The limits in the original form lead to the indeterminate form 0/0. Lastly, the limit of f(x)/g(x)f'(x)/g'(x) exists. Thus, we may apply the rule in both considered functions.

Q5 Is it possible to extend the convergence properties or derive error bounds from TD learning to TD-VCL?

A5 As in Sec. 6, we do not claim that the TD-VCL objective configures an RL algorithm or is equivalent to the Bellman Equation. Yet, we believe that the connections presented in the work inspire extensions in this direction. We hypothesize that it is possible to formally define a TD-VCL Operator, analogous to the Bellman Operator, and investigate contraction properties that motivates further work in statistical guarantees for posterior evaluation/improvement. Nonetheless, as stated in the paper, we left it as future work.

References are in Reviewer zpmQ's response.

审稿人评论
  1. Thank you for highlighting that! It is really helpful that you demonstrate on CIFAR100 and tinyimagenet, It's still a bit unclear how these results compare to the latest replay based or class incremental baselines outside of VCL methods. The memory constraints and the architectural restrictions are interesting but it would have been further interesting to see results under less restricted memory constraints as well. At what point do other methods "catch up"? 2) Acknowledged, but it would be good to put a little heads up about it and make it more clear. 3) Acknowledged and thank you for the clarification! I would have loved some error bounds here though even under a simplified set up. 4) Sounds good! 5) I appreciate that you see a potential avenue for future work in defining a TD-VCL operator and investigating contraction properties. From a theoretical perspective, it would still be valuable to offer even a partial or simplified analysis showing why these properties might hold under certain assumptions. That would help readers understand the scope of the analogy more concretely. It would make your paper a lot more theoretically concrete and interesting.
作者评论

Dear reviewer,

Thank you for acknowledging our rebuttal, updating your score accordingly, and providing a further reply! We hope we have addressed most of your concerns. We would like to provide a quick follow-up on your last comment as an acknowledgment from our side.

Re: Point 2 - We will add this clarification in Section 3 - thank you for your suggestion.

Re: Point 5 - Thank you for the suggestion! For now, we refer to Appendix C where we explain some connections between the TD-Targets in TD-VCL and RL and provide a more theoretical clarity in the analogy. We hope to share further results in this direction in future work!

Re: Point 1 - We understand your point about baselines, and we refer to our response A2 under reviewer zpmQ where we argue about this.

Regarding the memory constraints and architecture, we refer to Appendix H where we provide empirical evidence on why we perform such design choices in the benchmarks, which brings some evidence from what you request. Specifically, Figure 4 ablates the memory constraint for an Online MLE baseline in PermutedMNIST. With no constraint (T=10, B=60000), this simple baseline achieves an accuracy of 96.3%, which is roughly as good as what has been reported by prior variational CL methods [1, 2, 3]. This shows a level of saturation of these traditional benchmarks, motivating the design of new ones.

In terms of the architectural constraints, we refer to Figures 5 and 6, where we show the results in SplitMNIST (without architectural constraint) and SplitMNIST-Hard (with constraints), respectively. SplitMNIST also shows strong signs of saturation, while SplitMNIST-Hard presents a reasonable challenge and better contrast prior methods.

审稿意见
3

The current work tackles on continual learning, suggesting a new Bayesian CL approach. The paper proposes a rewriting of the standard variational continual learning objective that considers a number of past posterior approximations. The authors hypothesize explicit regularisation using previous posterior estimations prevents error compounding. Furthermore, the authors transform the objective even more introducing a geometric decays of the regularisation effect from past posteriors, drawing a parallel with lambda-returns in TD learning.

The contributions are two-fold: the formal derivation of a family of training objectives TD(lambda)-VCL, and the empirical validation of the benefits of considering multiple posterior approximations as regularisers.

给作者的问题

  1. Could you comment on how reliant this method is on knowing the task boundaries?
  2. Which of the n posteriors are used when sampling an example from dataset t-k from the replay buffer?

论据与证据

The central claim is that rewriting the objective such that it includes KL terms between the learned variational distribution and the nn previous approximations improves on standard variational continual learning. The experiments on the proposed benchmarks demonstrate both better average performance across all learned tasks, and also alleviated catastrophic forgetting.

方法与评估标准

The proposed benchmarks and the associated particularities (replay buffer size and single head restrictions) make sense and are detailed and justified in Annex H. The various ablations also make a convincing support for the central claims of the paper.

Although the benchmarks make sense, adopting previous protocols would have helped understanding where this method places in comparison to a larger set of family of continual learning algorithms.

理论论述

The theoretical "claim" consists in the equivalence between the "standard" variational continual learning objective and the derived N-Step TD-VCL and TD(λ)-VCL objectives. I did check the proofs in the annexes and I found those to be correct.

Second, the connection between temporal-difference objectives in reinforcement learning and the proposed variational learning cost is supported by adopting the MDP formalism. I am not sure how useful this parallel is, or if calling the objective a temporal difference helps understanding it, but the formal overlap proposed in Annex C makes sense.

实验设计与分析

Experimental design seems correct to me.

补充材料

I read the supplementary material. Proofs, ablations, and experiments / benchmarks details, all seem clear.

与现有文献的关系

The paper connects with the standard variational continual learning approach, and two variants which are also used as baselines. The "Related Work" section discusses the broader continual learning space, placing the current work "between regularization-based and replay-based methods".

遗漏的重要参考文献

Nothing major missing given that the scope of the paper is to improve variational continual learning algorithms. Given the goal of the paper, the literature review seems fair.

其他优缺点

The paper is clearly written, the claims are clear and supported by proofs and experiments.

What would make the paper stronger is comparing with non-bayesian continual learning methods to understand how this method compares with other families of algorithms or SOTA. Also, reusing benchmarks from previous papers (to the extent it makes sense given the reliance on a replay buffer) would help make such a comparison.

Most recent works on continual learning try to look at more complex metrics rather than just evaluating catastrophic forgetting (e.g. forward transfer, backward transfer, plasticity metrics, etc.). The paper would be stronger with a more complex analysis in that sense.

其他意见或建议

  1. Abstract (line 026): "integrate" -> "integrates"
  2. For completeness, define q1q_1 (or q0q_0) in equations 2 & 3.
作者回复

Thank you for your review! We appreciate that you recognized our contributions (in formalism and empirical validation), found our ablations convincing, and proposed benchmarks detailed and justified. We aim to comment/clarify some of the raised points below:

Q1 Adoption of previous protocols to compare with other CL methods

A1 We understand the value of previous protocols to establish a more direct comparison. Nonetheless, we found the previous setups commonly used in Bayesian CL works not very challenging since they do not impose the memory and architecture restrictions, which would not lead to a proper setup for evaluating Catastrophic Forgetting. Still, the remaining configurations are equivalent, including for the harder benchmarks like CIFAR100-10/TinyImageNet-10. Furthermore, we make sure to adopt strong Bayesian CL baselines [1-3], controlling several aspects of the training and tuning procedures to be fair among the methods, as detailed in Appendix F. As our goal is to advance continual learning in the Bayesian framework, we believe our followed protocol is reasonable and supportive of our claims.

Q2 Comparing with other families of Continual Learning Methods

A2 We agree that providing a more exhaustive set of CL baselines would provide a better perspective on the current CL research landscape. However, CL research spans several directions, adopting different assumptions and desiderata. We opted not to broaden the scope too much. Otherwise, it would be really hard to control the experiments and perform fair comparisons. Alternatively, our work keeps baselines consistent in these terms, which allows us to make direct claims about the impact of the proposed objective. Also, most methods explore orthogonal design choices (e.g., architecture, memory, regularization). Given the flexibility of our objective, it can be directly combined with them, as illustrated in Table 3.

Lastly, as the reviewer stated, our work has a particular goal of advancing Bayesian methods for CL. We highlight that the Bayesian framework follows a principled approach that allows the development of uncertainty-aware models, which is crucial for robust, safe Machine Learning. These are capabilities that most other methods do not provide, even if they present better predictive performance in some scenarios.

Q3 Presenting other metrics for CL.

A3 We agree this would provide a more complex analysis of the algorithms. We opted to follow the standard metrics in the bayesian CL literature [1-3], which allows us to evaluate downstream performance and Catastrophic Forgetting directly in order to support our main claims. Nonetheless, incorporating additional, more granular metrics is an interesting direction for future work, and we appreciate the reviewer's recommendation.

Q4 Why is the connection between TD methods and VCL useful?

A4 We argue that this connection allows us to observe the variational CL problem setting with the lens of bootstrapping/credit assignment. This opens several venues to leverage TD methods developed by RL research, configuring the posterior search as a structured problem of value estimation. Furthermore, it allows us to potentially extend the theoretical analysis of variational CL methods with tools from RL theory. We believe our work is just a starting point that identifies an interesting intersection of both areas with encouraging experimental results.

Q5 How reliant is the method in knowing the task boundaries?

A5 The problem setting adopted in this work (and in the considered baselines) assume the tasks provided with clear boundaries. Still, in a broader case with unknown boundaries, we may have different tasks mixed on the same timestep. For the optimization objective, we believe the method should still perform well as long as likelihood estimation is feasible. Ultimately, it would be a multi-task learning situation. The only potential concern we anticipate is the potential negative transfer effect among tasks in the same timestep.

Q6 Which of the nn posteriors are used when sampling an example from dataset tkt-k from the replay buffer?

A6 At a timestep tt, all predictions are performed with the current posterior qtq_{t}. Past posteriors are frozen and only used for regularization.

Q7 Other comments/sugestions

A7 Thank you for highlighting them, we incorporated the fixes in our draft.

Rebuttal References

[1] Nguyen et. al. Variational Continual Learning. ICLR, 2018.

[2] Ahn et. al. Uncertainty-based Continual Learning with Adaptive Regularization. NeurIPS, 2019.

[3] Ebrahimi et. al., Uncertainty-guided Continual Learning with Bayesian Neural Networks. ICLR, 2020.

[4] Katsevich et. al. On the approximation accuracy of Gaussian variational inference. The Annals of Statistics, 2023.

[5] Farquhar et al. Liberty or Depth: Deep Bayesian Neural Nets Do Not Need Complex Weight Posterior Approximations. NeurIPS, 2020.

审稿人评论

Thank you for replying to all the issues raised in the review. I think this work should be accepted, although a few aspects (raised by myself and reviewer 2qeN) still make a distinction between a very strong submission and the current work. I will keep my "weak accept" recommendation.

最终决定

The submission reframes the variational continual learning objective to use all historical approximate posteriors and datasets, and draws connections to experience replay during training and temporal-difference learning in RL (arguably, this connection does not bring new insights to either domain). The main baselines in the experiments are VCL and VCL with coresets. Unfortunately, the paper did not include more recent Bayesian continual learning numbers (e.g., Laplace or function-space regularisation), nor non-Bayesian methods, so the significance of the proposed method is not clear. The AC read through all the reviews and discussions, and the paper, and suggests rejection.

A minor comment: the numbers for VCL in the permuted MNIST benchmark seem much lower than those in previous VCL papers (Nguyen et al, Swaroop et al, and Loo et al).