PaperHub
5.0
/10
Poster4 位审稿人
最低4最高7标准差1.2
5
7
4
4
3.3
置信度
正确性2.8
贡献度2.5
表达2.5
NeurIPS 2024

Non-Stationary Learning of Neural Networks with Automatic Soft Parameter Reset

OpenReviewPDF
提交: 2024-05-16更新: 2025-01-15
TL;DR

Learning to reset Neural Networks parameters using Ornstein-Uhlenbeck to learn on non-stationary data

摘要

关键词
Non-stationarityplasticity lossonline learningdeep learning

评审与讨论

审稿意见
5

The paper focuses on non-stationary learning of neural networks. The paper proposes a method that automatically adjusts a parameter that influences the stochastic gradient descent (SGD) update to account for non-stationarity.

优点

The problem tackled is relevant and the related work is clearly discussed. The method is well-interpreted and relevant for the community. The paper is also very well presented.

缺点

In my view, the paper has some weaknesses, especially regarding the experimental validation:

  • The experimental validation does not clearly validate the benefits of the method, possibly due to lack of clarity. In Figure 3, it is not clear to me what method this paper proposes. I believe the paper should propose a method and validate it, instead of comparing all its possible variants. Nevertheless, I can see that most (if not all) of such variants are outperforming the chosen baselines. However, only two of the baselines referred in the related work were chosen. The chosen benchmark problems are also not motivated to me.
  • The reinforcement learning part is not clear to me. Is there an explanation for why, on-policy, the method is outperformed by the simple baseline? Is not on-policy the most non-stationary setting? Why? And why does the proposed method stop earlier than the competitors in the top-right figure?
  • It is not clear to me what the authors intend to show with Figure 2, and the perfect soft-resets regime. It would be nice to have more explanation here, and possibly move this validation to the end of the experimental section, as appearing before the main validation of the method is confusing.

Finally, even though the method is well motivated, it is not theoretically analyzed. We have no theoretical proof that the method will outperform traditional SGD in terms of efficiency, or preventing plasicity loss and catastrophic forgetting, or other dimension.

Minor: in page 6, Equation (15), what is the parameter λ~\tilde{\lambda}? Was it introduced before?

问题

I have the following questions:

  • Could the authors make some clarifications with respect to the experimental validation? Specifically, on the choice of baselines, problems, which method or methods the authors in fact propose to the community.
  • Can the authors also clarify my concern regarding the reinforcement learning setting and the performance in on-policy settings?
  • Can the authors clarify what is the intended take-away from the analysis in Figure 2?
  • Can the authors clarify the Monte-Carlo samples in Equation (7)? Specifically, what is sampled, for instance in the reinforcement learning setting, and if the sampling requires access to a free simulator of the environment.

局限性

Yes.

作者回复

We thank Reviewer DfC5 for their feedback. Please find our response below.

Figure 3, hard to read

Thanks to your feedback, we will modify the way we present the results, using the extended page limit for the camera-ready version. First, we will present Soft Reset method and compare it to baselines. Second, we will present other variants of the method and compare it to the main method.

baselines and benchmarks motivation

The benchmarks are motivated by Plasticity Loss literature [1-4]. In this setting, Neural Networks lose ability to learn (lose plasticity) when exposed to certain forms of non-stationarity. Random label MNIST and CIFAR10 benchmarks are most common ones. Moreover, in this setting, resetting is highly beneficial (see [1]).

We used 3 baselines, Shrink&Perturb [5], L2 Init [1] and Hard Resets. The Shrink&Perturb[5] is used to prevent plasticity loss by perturbing parameters according to θt+1=λθt+σϵ\theta_{t+1}=\lambda\theta_t+\sigma \epsilon where ϵN(0;I)\epsilon\sim\mathcal{N}(0;I). L2 Init approach adds θθ02||\theta - \theta_0||^2 to the loss and was shown to outperform many other approaches in preventing plasticity loss. Both baselines are highly related to our drift model and discussed in Related Work. We hope this clarifies the motivations.

RL is not clear...

Please see our response to Reviewer Hmj6.

Why in on-policy, simple method performs better than ours, given that it is more non-stationary setting?

In RL, the non-stationary rises from changing input data distribution (since we change the policy) as well as from changing learning targets (when learning value functions). In the off-policy RL, we expect a high impact of changing learning target non-stationarity. It was observed [4,6-7], that in this setting we can exhibit the loss of plasticity. The effect becomes more present as we increase the replay ratio (increase the number of updates per collected data). Under high replay ratio since we do many updates on replay buffer, we could overfit to the replay buffer and when we start collecting new data, the local minima of NN can switch drastically. This is the regime which motivated our method (see Figure 1 and beginning of Section 3). In this setting, hard resets [4] are effective but our method shows to be even more effective.

When we operate in on-policy setting, the situation described above is less likely to happen. If data distribution changes quickly, then we are less likely to overfit to it due to the noise in the update. Using the language from Figure 1, the uncertainty of the Bayesian posterior will not be shrinking if the data changes quickly. This, however, depends on the environment. The Hopper environment (Figure 4, bottom) is a relatively simple task to solve, meaning that the agent will progress fast and the corresponding input data distribution will change fast. The Humanoid environment (Figure 4, top) is a much more challenging environment, and the simple agent might struggle initially. This implies that even on-policy algorithm can see a lot of similar data and start to overfit to it, putting it closer to the off-policy setting. This explains why our method is more effective in Humanoid rather than Hopper environment when relpay ratio is 1.

why stops earlier?

The experiment was not fully finished at the time of submission. See attached pdf for final version of the Figure.

Figure 2 ?

In Figure 2, we demonstrate that drift model eq.4 coupled with update rule eq.16 is a good strategy of resetting parameters when task boundaries are known. For appropriately chosen γ0\gamma \neq 0, we can achieve significantly better performance than hard reset. In Figure 2, left, we use constant l.r. αt\alpha_t whereas in Figure 2,right, we use αt(γt)\alpha_t(\gamma_t) from eq.17, which is more beneficial. We will add more clarifications in the text about the nature of this experiment.

Theoretical analysis

Unfortunately, the phenomenon of plasticity loss in Neural Networks is not theoretically analyzed (there is no model for that), it is an empirical phenomenon [1-4]. We would require this theoretical framework for plasticity loss to be developed in order to analyze our method. Our future plans involve analyzing the method in the context of online convex optimization.

MC samples in eq.7?

MC samples are not coming from a simulator but from a Gaussian distribution eq.5 induced by the drift model eq.4. MC samples are needed to approximate integral in eq.7. The MC samples correspond to NN parameters.

Overall, we hope our answer clarifies the points which you raised during your review. If we have addressed your questions, we would be grateful if you would consider increasing your score. We would be happy to answer any further questions you might have.

References:

[1] Maintaining Plasticity in Continual Learning via Regenerative Regularization, Saurabh Kumar, Henrik Marklund, Benjamin Van Roy, 2023

[2] Understanding plasticity in neural networks, Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, Will Dabney, 2023

[3] Disentangling the Causes of Plasticity Loss in Neural Networks, Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, Will Dabney, 2024

[4] Understanding and Preventing Capacity Loss in Reinforcement Learning, Clare Lyle, Mark Rowland, Will Dabney, 2022

[5] On Warm-Starting Neural Network Training, Jordan T. Ash, Ryan P. Adams, 2020

[6] The Primacy Bias in Deep Reinforcement Learning, Evgenii Nikishin, Max Schwarzer, Pierluca D'Oro, Pierre-Luc Bacon, Aaron Courville, 2022

[7] The Dormant Neuron Phenomenon in Deep Reinforcement Learning, Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, Utku Evci, 2023

评论

I thank the authors for the response.

I had some points clarified. However, I am inclined to maintain my score, since the experimental validation is somehow confusing. Regarding the RL results, I can not see clear benefits from the approach. Regarding Figure 3, the authors replied that their method is Soft Reset, but I see it being outperformed by several of its variants. Then why do the authors choose as their proposed method Soft Reset and not one of the variants that outperform it?

Thank you.

评论

Dear Reviewer DfC5, thank you for your response.

Regarding Figure 3, the authors replied that their method is Soft Reset, but I see it being outperformed by several of its variants. Then why do the authors choose as their proposed method Soft Reset and not one of the variants that outperform it?

It is true that it is outperformed by other variants. Soft Reset is the cheapest method among all the other variants -- it requires only one update on γt\gamma_t and one update on parameters θt+1\theta_{t+1}. All the other methods -- Soft Reset more compute, Soft Reset proximal, Bayesian Soft Reset proximal, require significantly more compute. Figure 3 suggests that we can in fact leverage more compute to achieve better empirical performance.

This is why we chose to present Soft Reset as the main method because it is relatively cheap and already achieves good performance compared to external baselines. However, when more compute is available, other methods could also be used to further improve performance.

Hope this clarifies the confusion.

审稿意见
7

The paper proposes a method to effectively learn the neural network parameters in non-stationary environments. They propose a modified learning algorithm that adapts to non-stationarity through an Ornstein-Uhlenbeck process with an adaptive drift parameter. Drift parameter is used to track the non-stationarity in the data - specifically, adaptive drift pulls the NN parameters towards initial parameters when the data is not well-modelled by current parameters while on the other hand reducing the degree of regularization (towards initial parameters) when the data is relatively stationary. The paper introduces a Bayesian Soft-Reset algorithm in Bayesian Neural Network (BNN) framework and a General Soft-Reset algorithm in deterministic non-Bayesian framework to effectively learn in non-stationary environments. Empirically, authors demonstrate that their method performs effectively in non-stationary supervised and off-policy reinforcement learning settings.

优点

To take into account the non-stationarity in the data - the methodology in the paper modifies the SGD algorithm by modifying the regularization strength and the regularization target of the SGD algorithm - a significant contribution of the work is to do this modification in a principled manner using a drift parameter that is updated online from the stream of data.

Overall, the paper is well written. The need and novelty of the methodology are clearly presented, and the illustration of the use of methodology on realistic instances clearly demonstrates how it works and improves on the existing methodologies for learning in non-stationary data regime.

缺点

Some parts can be improved -

  1. Section 2 - Simplification of equation (6) to (9) requires some explanation [page -5].
  2. Section 3 - Some intuition behind the objective (11) [page - 6] would be good to understand it better (especially the second term in that objective)
  3. Some discussion on how does the methodology perform relative to the degree of non-stationarity in the data?
  4. Some minor typos in line 109, 154, 190

问题

How does the methodology perform relative to the degree of non-stationarity in the data? Specifically, to what extent does this methodology remain effective under varying levels of non-stationarity?

局限性

Authors have adequately addressed the limitations of their work.

作者回复

We thank the reviewer 6GY1 for their positive feedback. Please find our answer below.

Section 2 - Simplification of equation (6) to (9) requires some explanation [page -5]

The equation 9 is obtained from 6 as follows. First, we linearise the log-likelihood function logp(yt+1xt+1,θ)\log p(y_{t+1} | x_{t+1}, \theta) around μt\mu_t which equals to logp(yt+1xt+1,θ)logp(yt+1xt+1,μt)+gt+1T(θμt)\log p(y_{t+1} | x_{t+1}, \theta) \sim log p(y_{t+1} | x_{t+1}, \mu_t) + g_{t+1}^T (\theta - \mu_t). Then, we notice that p(yt+1xt+1,θ)=explogp(yt+1xt+1,θ)p(y_{t+1} | x_{t+1}, \theta) = \exp^{\log p(y_{t+1} | x_{t+1}, \theta)} which means that p(yt+1xt+1,θ)p(yt+1xt+1,μt)expgt+1Tμtexpgt+1Tθp(y_{t+1} | x_{t+1}, \theta) \sim p(y_{t+1} | x_{t+1}, \mu_t) \exp^{-g_{t+1}^T \mu_t} \exp^{g_{t+1}^T \theta}. Then, since qtt+1(θSt,Γt1,γt)q_{t}^{t+1}(\theta|S_t,\Gamma_{t-1},\gamma_t) is a Gaussian, we can write compute the integral in a closed form. After that, we keep only the terms which depend on γt\gamma_{t}, since we only are interested in the optimization of γt\gamma_t.

We will add a full derivation of this expression in the Appendix.

Section 3 - Some intuition behind the objective (11) [page - 6] would be good to understand it better (especially the second term in that objective)

The equation in objective 11 is negative Evidence Lower Bound (ELBO) on approximate predictive log likelihood eq 6, if all λi=1\lambda_i = 1. It could be derived by considering the integral in the right-hand side of eq. 6, dividing and multiplying it by q(θ)q(\theta) and applying Jensen Inequality. The fact that we introduce λi1\lambda_i \neq 1 means that we introduce a temperature parameter (see [1]) on the prior for each dimension ii. It was shown empirically [1] that using a temperature in the prior leads to better empirical results. We will add more explicit explanations of how to derive the objective 11 and will add full derivation in the Appendix.

How does the methodology perform relative to the degree of non-stationarity in the data? Specifically, to what extent does this methodology remain effective under varying levels of non-stationarity?

Our experiments in Figure 3 and Figure 4 partially answer this question. The difference between Data Efficient and Memorization regimes in Figure 3 is the amount of epochs given to a task – 70 epochs for Data Efficient and 400 for Memorization. There is more non-stationarity in the Data-Efficient setting since it changes more frequently and our methodology remains more effective than baselines.

In general, we expect our method to be very helpful in scenarios when there are relatively long stationary phases which are succeeded by large changes of the optimal parameters. Moreover, given that we can define γt\gamma_t per parameter or per layer, we can have a combination of these scenarios -- some parameters stay stationary whereas other change significantly. In general, such scenario would occur when we have stationary segments which are followed by non-stationary ones. In case, when the non-stationarity is very strong and the data distribution changes at every step, we think that it is less likely that our method will bring an additional benefit over SGD since in this setting, SGD will be constantly refreshed by the noise from the data. SGD, however, will struggle when there is a large change after a long stationary phase. Compared to reset-type algorithms, our method is much more adaptive, and relearns the parameters only when it "has to" (i.e., when the data changes sufficiently).

We see a partial support for this claim in our RL experiments in Figure 4, where as we become more off-policy, we see much more benefit of our method over standard learning approach. Moreover, we also see a benefit of our method over baseline in a more on-policy regime in Figure 4, left, top on Humanoid environment. Since this environment is hard to solve, the baseline agent will see a lot of similar data initially and therefore the corresponding RL algorithm could overfit. Constantly refreshing the parameters is beneficial here. On the other hand, if we study Figure 4, left, bottom which shows performance in Hopper environment, we can see that Soft Resets is less effective. In this setting, since the problem is easy, the baseline method sees a lot of different data constantly and is less prone to overfit.

Some minor typos in line 109, 154, 190

Thank you for the catch, we will make the adjustments.

References:

[1] How Good is the Bayes Posterior in Deep Neural Networks Really?, Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin, 2020

审稿意见
4
  • The authors study a learning algorithm that can handle the non-stationarity of the data distribution.
  • They propose a parameter drift model based on the Ornstein-Uhlenbeck process, which models a form of “soft parameter reset” adaptive to the data stream. The drift model has an adaptive parameter γt\gamma_t which a gradient-based optimizer can learn online.
  • They illustrate the update rule of the main parameters of a neural network incorporating the learned drift model. The update rule is first proposed under the Bayesian neural network framework and then later adapted to a non-Bayesian neural network.
  • They numerically corroborate the efficacy of their method on plasticity benchmarks and reinforcement learning experiments.

优点

  • S1. They propose a novel way to learn online the amount of resetting the parameters.
  • S2. The method is general enough to be applied broadly to continual learning problems and reinforcement learning problems.

缺点

  • W1. Modeling parameter drift is not well-motivated.
    • The learnable drift model p(θt+1θt,γt)p(\theta_{t+1} \mid \theta_t, \gamma_t) is the main contribution of this work. In my opinion, however, it is unclear why we should care about the drift of “currently learning” neural network parameters.
    • According to the beginning of Section 3, the local minima of the objective function may change over time. This is of course acceptable. However, I cannot understand what is relevant between the change of the local optima θt\theta^{\star}_t and the current parameter θt\theta_t. To be more specific, the two sentences in lines 94-96 are not connected well.
  • W2. The paper is NOT self-contained overall.
    • The authors claim that they chose the Ornstein-Uhlenbeck (OU) process as a drift model. However, it is unclear how a continuous stochastic differential equation (OU process) can be converted to a discrete Markovian chain defined in Equation (4). The paper does not even introduce the actual form of the OU process. Moreover, it is not very clear why the authors chose the particular model defined in Equation (4) because it is not known (at least in the paper) that the model is a unique option to choose.
    • There are a lot of equations whose derivation is omitted. Let me list them: Equations (9), (10), (11), (14), (15). In particular, I am suspicious of both the validity and usefulness of Equation (10).
    • For these reasons, the paper is not so easy to follow.
  • W3. The soft reset method seems computationally heavy.
    • I am worried about the computational cost. Although it is good to learn the forgetting parameter γt\gamma_t online, it increases the computational cost almost twice. It might be more than just twice because learning the γt\gamma_t requires Monte-Carlo (MC) sampling.
    • In addition, there seem too many hyperparameters to tune.
  • W4. Minor comments on typos/mistakes
    • Do not capitalize the word “neural network”. (e.g., line 35-36)
    • The term “plasticity loss” may be read as a type of loss function for some people (new to this field). I recommend using the term “loss of plasticity”.
    • Line 84: “non-stationary” → “non-stationarity”
    • Line 109: “mportant” → “important”
    • Line 127: “wrt” → “with respect to”. Do not use abbreviations.
    • Line 137: “property” → “properties”
    • Line 154: what is “s” next to a parenthesis?
    • Line 168: “θ=(θi,,θD)\theta = (\theta_i, \ldots, \theta_D)” → “θ=(θ1,,θD)\theta = (\theta_1, \ldots, \theta_D)
    • Line 169: “RD\mathbb{R}^D” is a more standard notation than “RD\mathcal{R}^D”.
    • Line 222: In “λ_i=λ^_iσt,i2\lambda\_i = \hat{\lambda}\_i \sigma^2_{t,i}", I think λ^\hat{\lambda} must not have a subscript.
    • Line 225: “sinc” → “since”
    • Equation (14): What is “F~\tilde{\mathcal{F}}”? (Partial) gradient of “F\mathcal{F}”?
    • Line 232: What does it mean by “we assume that μ0=θ0,σ02.\mu_0 = \theta_0, \sigma_0^2.”?
    • Line 236: “θt,i(γt,i)=θt\theta_{t,i}(\gamma_{t,i}) = \theta_t” → “θt,i(γt,i)=θt,i\theta_{t,i}(\gamma_{t,i}) = \theta_{t,i}
    • Line 240: “linearisng” → “linearizing” or “linearising”
    • Line 292: duplicate “See”
    • Line 354: “modelling” → “modeling”
    • Lines 492-493: I guess “data efficient” and “memorization” are swapped.

问题

Q1. As far as I know, the Hard Reset method typically resamples the model parameter every time it resets the parameter. With this in mind, what if we slightly modify the drift model as: p(θθt,γt)=N(θ;γtθt+(1γt)θ0;(1γt2)σ02)p(\theta \mid \theta_t, \gamma_t) = \mathcal{N} (\theta; \gamma_t \theta_t + (1-\gamma_t) \theta’_0 ; (1-\gamma_t^2)\sigma_0^2), where we re-sample θ0p0(θ0)\theta’_0 \sim p_0(\theta_0) every time tt?

局限性

The paper discusses its limitations in the experiment section. I think it might be better to mention them explicitly in the conclusion section.

作者回复

We thank reviewer 4koq for their response. Please find our detailed answer below.

why drift model...

In Section 3 and in Appendix D, we presented the reasons to use a drift model together with learning NN parameters. Figure 1 illustrates the high-level intuition in case of online Bayesian estimation. Using SGD language, assuming that the data is stationary up to the time TT, SGD estimate θT\theta_{T} would tend towards a local optima θ\theta^*. If data stays stationary at time T+1T+1, θT+1\theta_{T+1} will move closer to the θ\theta^*. However, if at time T+1T+1, the data distribution changes and so is the set of local optima, SGD might struggle to move towards this set starting from θT\theta_T. Drift model allows the learning algorithm to make larger moves towards this new set of local optima. Such idea was also used in the context of online convex optimization, see [2-3]. The form in eq.4 encourages parameters to return back towards the initialization over time by shrinking the mean and increasing the variance (l.r. in SGD), allowing to make bigger steps towards new local minima.

We will change the wording of the beginning of Section 3 to better reflect this explanation.

form of OU process

OU process defines a SDE that can be solved explicitly and written as a time-continuous Gaussian Markov process with transition density p(xtxs)=N(xse(ts),(1e2(ts))σ02I)p(x_t|x_s)=\mathcal{N}(x_s e^{-(t-s)},(1-e^{-2(t-s)})\sigma_0^2I) for any pair of times t>st>s. Based on this, as a drift model for the parameters θt\theta_t (so θt\theta_t is the state xtx_t) we use the conditional density p(θt+1θt)=N(θtγt,(1γt2)σ02I)p(\theta_{t+1}|\theta_t)=\mathcal{N}(\theta_t\gamma_t,(1-\gamma_t^2)\sigma_0^2I) where γt=eδt\gamma_t=e^{-\delta_t} and δt0\delta_t\geq0 corresponds to the learnable discretization time step. In other words, by learning γt\gamma_t online we equivalently learn the amount of a continuous “time shift” δt\delta_t between two consecutive states in the OU process. This essentially models parameter drift since e.g. if γt=1\gamma_t=1, then δt=0\delta_t=0 and there is no “time shift" implying θt+1=θt\theta_{t+1}=\theta_t.

...why eq.4

In Section 3, we argued for choosing eq.4 because it pushes the learning towards the initialization as well as towards previous parameters, allows to use gradient-based methods to estimate drift parameters and keeps positive finite variance (assuming we learn over the infinite amount of time) which avoids degenerate cases of 00 or \infty variance. Moreover, it couples the mean and the variance via γ\gamma, making it easier to learn γ\gamma and less likely to overfit. Other potential choices for the drift models:

  • θt+1=γθt+(1γ)μ0+βϵ\theta_{t+1}=\gamma\theta_t+(1-\gamma)\mu_0+\beta\epsilon, with ϵN(0,I)\epsilon\sim\mathcal{N}(0,I)

    • Fixed β\beta was explored in our experiment in Figure 2, left, where we used constant l.r. αt\alpha_t for parameters update. We found that it performed worse than using rescaled l.r. α(γt)\alpha(\gamma_t) from eq.17 which is derived using our model, see Figure 2, right. For μ0=0\mu_0=0, we recover Shrink&Perturb [4]
    • Learning γ\gamma and β\beta will likely overfit
  • Mixture p(θt+1θt,γt)=γtp(θt+1θt)+(1γt)p0(θt+1)p(\theta_{t+1}|\theta_t,\gamma_t)=\gamma_tp(\theta_{t+1}|\theta_t)+(1-\gamma_t)p_0(\theta_{t+1}), where p(θt+1θt)=N(θt;σ2)(θt+1)p(\theta_{t+1}|\theta_t)=\mathcal{N}(\theta_t;\sigma^2)(\theta_{t+1}) and p0(θt+1)=N(μ0;σ02)(θt+1)p_0(\theta_{t+1})=\mathcal{N}(\mu_0;\sigma_0^2)(\theta_{t+1}), which is a Gaussian version of Spike&Slab [5]. This encourages hard resets instead of soft ones. Moreover, using mixtures is problematic since the KL in eq.11 cannot be computed exactly and needs to be approximated.

We will add the discussion of drift model choices in the appendix.

derivations...

Please see our reply to Reviewer 6GY1 about the eq.9 and eq.11. Eq.10 is obtained from eq.9 by finding fixed points. In eq.10, we have a typo, the term λγt0\lambda\gamma_t^0 in the denominator should be replaced by λ\lambda. Eq.14 is a gradient descent rule applied to eq.11(eq.12) where F~\tilde{F} is actually a derivative of FF wrt μt\mu_t and σt\sigma_t. Eq.15 is Maximum a-posteriori (MAP) update for θ\theta from the posterior p(yt+1xt+1,θ)qt+1t(θγt)p(y_{t+1}|x_{t+1},\theta)q^{t+1}{t}(\theta |\gamma_t) and where we use temperature in the prior (see [3]).

We will add all the derivations and explanations in the appendix.

θ\theta' every time

While a possible modification for the drift model, it could be problematic to use, since resampling θ0\theta_{0} has higher variance than OU process. The variance of this model is γt2σt2+(1γt)2σ02+(1γt2)σ02\gamma_t^2\sigma^2_t+(1-\gamma_t)^2\sigma^2_0+(1-\gamma^2_t)\sigma^2_0, which equals to 2σ022\sigma^2_0 for γt=0\gamma_t=0 which is 2x larger than the variance of the initialization. This implies that such a model might inflate the variance over time since e.g. if γt=0.5\gamma_t=0.5 for all time steps tt, then σt+12=γt2σt2+2(1γt)σ02=0.52σt2+σ02\sigma_{t+1}^2=\gamma_t^2 \sigma^2_t+2(1-\gamma_t)\sigma^2_0=0.5^2\sigma^2_t+\sigma^2_0 which grows with the iterations.

Thus, we believe that our current drift model more accurately implements the mechanism of parameter resets since it always brings us back to the exact initialization distribution.

We hope we have been able to address your concerns and we are happy to answer further questions on the above subjects. If we were able to address your concerns, we would be grateful if you would consider increasing your review score.

References:

[1] Dynamical Models and Tracking Regret in Online Convex Programming, Eric C. Hall, Rebecca M. Willett, 2013

[2] Adaptive Gradient-Based Meta-Learning Methods, Mikhail Khodak, Maria-Florina Balcan, Ameet Talwalkar, 2019

[3] How Good is the Bayes Posterior in Deep Neural Networks Really?, Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin, 2020

[4] On Warm-Starting Neural Network Training, Jordan T. Ash, Ryan P. Adams, 2020

[5] Spike and slab variable selection: Frequentist and Bayesian strategies, Hemant Ishwaran, J. Sunil Rao, 2005

评论

I have read the author’s rebuttal, but I still have remaining concerns and questions. Let me leave my comments/questions about the author’s responses one at a time.

  1. I don’t think this answer really responds to my question in W1. Here, I am not asking about the motivation of the drift model (although I did it in W2). My question is: why do we care about the conditional distribution of θt+1\theta_{t+1}, a consequence of a single step of training from θt\theta_t, even though we want to estimate or utilize the dynamics of the moving local optima θ_tθ_t+1\theta^{\ast}\_t \mapsto \theta^{\ast}\_{t+1}?
    • After pondering this issue and staring at Figure 1, I suddenly realized that the drift model is actually modeling the dynamics of θ_t+1\theta^{\ast}\_{t+1} given θt\theta_t. So in my understanding, the θt+1\theta_{t+1} in “p(θt+1θt,γt)p(\theta_{t+1} \mid \theta_t, \gamma_t)” actually means θt+1\theta^{\ast}_{t+1}, the local optima at time t+1t+1, rather than the parameter we will have by updating the parameter from θt\theta_t using a single step of learning algorithm. (Or, at least, we want to find a local optimum θ_t+1\theta^{\ast}\_{t+1} close to the initialization & θt\theta_t.) Is my understanding correct? If it is, I think this should have been explained in the paper in more detail; also, the term “parameter drift” is a bit misleading in this sense.
    • By the way, I guess the citation ([2-3]) has a typo, right? It seems that it should include [1].
  2. Thank you for the explanation of the relationship between the OU process, the Gaussian Markov process, and the choice of the drift model. Will the authors add this explanation to their paper? Or do they just ignore it?
  3. Thank you for the discussion on the other options for drift models and for considering putting the discussion to the paper.
  4. Thank you for derivations of the equations in more detail. However, I think the derivations could be more kind than the current explanation.
    • Eq 9: Please state to which parameter the linearization is taken (e.g., By linearizing log… around θ=μt\theta=\mu_t (cf. Line 198))
    • Eq. 11 & 12: Is these are just the usual objective functions in BNN training?
  5. Thank you for the interesting response based on the drift model and the variance. But what if we change the variance scheduling: what if we suitably change the drift model into p(θθt,γt)=N(θ;γtθt+(1γt)θ0(t);2γt(1γt)σ02)p(\theta \mid \theta_t, \gamma_t) = \mathcal{N} (\theta; \gamma_t \theta_t + (1-\gamma_t) \theta’_0(t); 2\gamma_t(1-\gamma_t)\sigma_0^2) where θ0(t)p0(θ0)\theta’_0(t) \sim p_0(\theta_0)? I guess this model does not suffer from the same problem of variance inflation: If σt=σ0\sigma_t = \sigma_0, then Var(θ)=γt2σt2+(1γt)2σ02+2γt(1γt)σ02=γt2+(1γt)2+2γt(1γt)σ02=σ02{\rm Var}(\theta) = \gamma_t^2 \sigma_t^2 + (1-\gamma_t)^2 \sigma_0^2 + 2\gamma_t (1-\gamma_t) \sigma_0^2 = \\{\gamma_t^2 + (1-\gamma_t)^2 + 2\gamma_t (1-\gamma_t)\\}\sigma_0^2 = \sigma_0^2...! The motivation behind this is the concern about the dependency on the particular initialization “θ0\theta_0” which seems to be fixed throughout the training: “What if the initialization distribution is only the matter?” Please let me know if there are other problems in this drift model, or I would be very happy if you could test this drift model empirically (but I understand if it is impossible due to the time limit)!

Also, there is a question that is not answered yet. Let me bring it here:

  • Line 232: What does it mean by “we assume that μ0=θ0,σ02.\mu_0 = \theta_0, \sigma_0^2.”?

Although I appreciate the time and effort invested in the rebuttal and am happy with the author's general response, overall, I feel like the author’s response is not really satisfying yet. Even though the authors requested to reconsider my assessment, I cannot do so before I get a satisfying further response. If the further response is not satisfactory as well, sadly and unfortunately, I am ready to decrease my score to 3 (but not yet) because, in my view, there is still a huge room for improvement in writing and presentation.

评论

Dear 4koq, thank you for your feedback and please find our answer to your concerns below.

I don’t think this answer really responds to my question in W1. Here, I am not asking about the motivation of the drift model (although I did it in W2). My question is: why do we care about the conditional distribution of Θt+1 , a consequence of a single step of training from θt , even though we want to estimate or utilize the dynamics of the moving local optima θt∗↦θt+1∗? After pondering this issue and staring at Figure 1, I suddenly realized that the drift model is actually modeling the dynamics of θt+1∗ given θt.. So in my understanding, the θt+1 in “p(θt+1∣θt,γt)” actually means θt+1∗ , the local optima at time t+1, rather than the parameter we will have by updating the parameter from θt using a single step of learning algorithm. (Or, at least, we want to find a local optimum θt+1∗ close to the initialization & θt .) Is my understanding correct? If it is, I think this should have been explained in the paper in more detail; also, the term “parameter drift” is a bit misleading in this sense.

We apologize for the confusion, but under the Bayesian perspective depicted in Figure 1 θt\theta^*_{t} and θt+1\theta^*_{t+1} denote fixed/deterministic values and therefore are not random variables, while the corresponding random variables which are assigned probability distributions are θt\theta_{t} and θt+1\theta_{t+1}. More specifically, we learn the dynamical model distribution p(θt+1θt,γt)p(\theta_{t+1}|\theta_{t},\gamma_t), by learning γt\gamma_t, so that the specific value θt+1\theta^*_{t+1} can become more likely or “explainable” under the distribution p(θt+1θt,γt)p(\theta_{t+1}|\theta_{t},\gamma_t). In other words we hope that p(θt+1θt,γt)p(\theta_{t+1}|\theta_{t},\gamma_t) will place a considerable probability mass around the fixed value θt+1\theta^*_{t+1}. We are going to update the paper to make the above clear and remove the confusion.

To further explain here this in Figure 1, consider the stationary case depicted in Figure,1a. There the posterior concentrates around an optimal value (which is fixed value θ\theta^*), in other words qt(θ)q_t(\theta) will gradually converge to a delta mass around the value θ\theta^*. In a non-stationary case, the optimal value can suddenly change over time and e.g. from the value θt\theta^*_{t} at time tt can change to θt+1\theta^*_{t+1} . Without a dynamical model (Figure 1,b) the new posterior qt+1(θ)q_{t+1}(\theta) after observing the new data at time t+1t+1 has a small radius/variance (blue dashed circle) and it cannot concentrate fast enough towards the new optimum. The use of the dynamical or drift model (Figure 1,c) introduces an “intermediate step” that constructs an “intermediate prior distribution” pt(θ)=qt(θ)p(θθt,γt)dθtp_t(\theta) = \int q_t(\theta) p(\theta | \theta_t, \gamma_t) d \theta_t which can have increased variance and shifted mean (green dashed circle). Then once we fully incorporate the new data point at time t+1t+1 the updated posterior qt+1(θ)p(yt+1θ)pt(θ)q_{t+1}(\theta) \propto p(y_{t+1}|\theta) p_t(\theta) can better concentrate around θt+1\theta_{t+1}^*. We will update the paper to clarify the above.

citation [2-3] is a typo

Thank you, it should be [1-2].

Q2

Yes we will add this to the paper (we forgot to explicitly write it).

Q3

Thank you.

Eq.11&12 -- usual BNN objective?

Yes, except for the temperature defined per-parameter. In BNN, temperature λ\lambda is either 1, or fixed to a constant for all the parameters.

Line 232: What does it mean by “we assume that μ0=θ0,σ02.\mu_0=\theta_0,\sigma^2_0.?

It means that the mean of the prior p0(θ)=N(θ;μ0,σ2)p_0(\theta)=\mathcal{N}(\theta;\mu_0,\sigma^2) is fixed to θ0\theta_0, i.e. the initialization and σ02\sigma^2_0 is a constant. Normally NNs are initialized from p0(θ)=N(θ;0,σ02)p_0(\theta)=\mathcal{N}(\theta;0,\sigma^2_0), but we make the prior to be initialization-dependent. We will rewrite this sentence to explicitly highlight this.

We hope your answer clarifies the questions and concerns you have raised, and we thank you for raising interesting points for discussion!

评论

Eq.9, linearisation

As we stated above in the rebuttal, the response to Reviewer 6GY1 (due to space constraints) contains the partial derivation. Linearisation is logp(yt+1xt+1,θ)logp(yt+1xt+1,μt)gt+1T(θμt)\log p(y_{t+1}|x_{t+1},\theta)\sim\log p(y_{t+1}|x_{t+1},\mu_t)-g_{t+1}^T(\theta-\mu_t), where gt+1=θlogp(yt+1xt+1,θ=μt)g_{t+1}=\nabla_{\theta} -\log p(y_{t+1}|x_{t+1},\theta=\mu_t). Then, p(yt+1xt+1,θ)p(yt+1xt+1,μt)expgt+1Tμtexpgt+1Tθp(y_{t+1}|x_{t+1},\theta) \sim p(y_{t+1}| x_{t+1},\mu_t)\exp^{-g_{t+1}^T\mu_t}\exp^{g_{t+1}^T\theta}. Since, qtt+1(θSt,Γt1,γt)q_t^{t+1}(\theta | S_{t}, \Gamma_{t-1}, \gamma_t) is Gaussian, we compute the integral eq.6 in a closed form and keep only the terms depending on γt\gamma_t. We provide derivation for a scalar case here and full derivation in the appendix. In eq.6, we have logp(yt+1xt+1,θ)exp(θμt(γt))22σ2(γt)dθ=log12πσ2(γt)p(yt+1xt+1,μt)expgt+1(θμt)exp(θμt(γt))22σ2(γt)dθ\log\int p(y_{t+1}|x_{t+1},\theta)\exp^{-\frac{(\theta -\mu_t(\gamma_t))^2}{2\sigma^2(\gamma_t)}}d\theta =\log \frac{1}{\sqrt{2\pi\sigma^2(\gamma_t)}}\int p(y_{t+1}|x_{t+1},\mu_t)\exp^{-g_{t+1}(\theta-\mu_t)} \exp^{-\frac{(\theta-\mu_t(\gamma_t))^2}{2\sigma^2(\gamma_t)}}d\theta

which becomes

logp(yt+1xt+1,μt)+log12πσ2(γt)expgt+1(θμt)exp(θμt(γt))22σ2(γt)dθ\log p(y_{t+1}|x_{t+1},\mu_t)+\log \int \frac{1}{\sqrt{2\pi\sigma^2(\gamma_t)}}\exp^{-g_{t+1}(\theta-\mu_t)}\exp^{-\frac{(\theta-\mu_t(\gamma_t))^2}{2\sigma^2(\gamma_t)}}d\theta

Consider only the exponent term in the integral

12σ2(γt)(2σ2(γt)gt+1(θμt)+θ22θμt(γt)+μt(γt)2)\frac{-1}{2\sigma^2(\gamma_t)}(2\sigma^2(\gamma_t)g_{t+1}(\theta-\mu_t)+\theta^2-2\theta \mu_t(\gamma_t)+\mu_t(\gamma_t)^2)

then

12σ2(γt)(θ22θ[μt(γt)σ2(γt)gt+1]+μt(γt)22σ2(γt)gt+1μt)\frac{-1}{2\sigma^2(\gamma_t)}\left(\theta^2-2\theta \left[\mu_t(\gamma_t)-\sigma^2(\gamma_t)g_{t+1}\right]+\mu_t(\gamma_t)^2-2\sigma^2(\gamma_t)g_{t+1}\mu_t\right)

then

12σ2(γt)[(θ(μt(γt)σ2(γt)gt+1))2+2μt(γt)σ2(γt)gt+1σ4(γt)gt+122σ2(γt)gt+1μt)]\frac{-1}{2\sigma^2(\gamma_t)}\left[ \left(\theta-(\mu_t(\gamma_t)-\sigma^2(\gamma_t)g_{t+1}) \right)^2+2\mu_t(\gamma_t) \sigma^2(\gamma_t)g_{t+1}-\sigma^4(\gamma_t)g_{t+1}^2-2\sigma^2(\gamma_t)g_{t+1}\mu_t)\right]

then

12σ2(γt)(θ(μt(γt)σ2(γt)gt+1))2μt(γt)gt+1+0.5σ2(γt)gt+12+gt+1μt\frac{-1}{2\sigma^2(\gamma_t)} \left(\theta-(\mu_t(\gamma_t)-\sigma^2(\gamma_t)g_{t+1})\right)^2-\mu_t(\gamma_t)g_{t+1}+0.5\sigma^2(\gamma_t)g^2_{t+1}+g_{t+1}\mu_t

Now, get back to the integral

log1(2πσ2(γt)exp12σ2(γt)(θ(μt(γt)σ2(γt)gt+1))2μt(γt)gt+1+0.5σ2(γt)gt+12+gt+1μtdθ\log\int\frac{1}{\sqrt(2\pi\sigma^2(\gamma_t)}\exp^{\frac{-1}{2\sigma^2(\gamma_t)}\left(\theta-(\mu_t(\gamma_t)-\sigma^2(\gamma_t)g_{t+1}) \right)^2-\mu_t(\gamma_t)g_{t+1}+0.5\sigma^2(\gamma_t)g^2_{t+1}+g_{t+1}\mu_t}d\theta

Which equals to

1μt(γt)gt+1+0.5σ2(γt)gt+12+gt+1μt1-\mu_t(\gamma_t)g_{t+1}+0.5 \sigma^2(\gamma_t)g^2_{t+1}+g_{t+1}\mu_t

We keep only the terms which depend on γt\gamma_t and get

G(γt)=μt(γt)gt+1+0.5σ2(γt)gt+12G(\gamma_t)=-\mu_t(\gamma_t)g_{t+1}+0.5 \sigma^2(\gamma_t)g^2_{t+1}

We want to maximize G(γt)G(\gamma_t) or minimize F(γt)=G(γt)F(\gamma_t)=-G(\gamma_t) which is exactly the loss we defined in eq.9.

We will add the full derivation in the appendix.

Different drift model

Assuming that θN(θ;μ0;σ02)\theta'\sim\mathcal{N}(\theta;\mu_0;\sigma^2_0), the model you wrote will have the mean γtθt+(1γt)μ0\gamma_t\theta_t+(1-\gamma_t)\mu_0 and the variance γt2σt2+(1γt2)σ02\gamma^2_t \sigma^2_t+(1-\gamma^2_t)\sigma^2_0, which are the same as for our OU model, see eq.4 and Line 173 after eq.5, therefore they are mathematically equivalent.

In order to discuss it in detail, let μ0\mu_0 is the prior mean for eq.4 and μ0\mu'_0 is the prior mean in your variant, i.e., θN(θ;μ0;σ02)\theta'\sim\mathcal{N}(\theta;\mu'_0;\sigma^2_0). When we incorporate the Gaussian model drift in the update, we are only concerned with its mean and variance since these affect SGD updates (see eq.16).

We have the following three cases:

  • μ0=μ0=0\mu_0=\mu_0'=0, the models have the same mean equal to 00. They are mathematically equivalent to the model θt+1=γtθt+1γt2θ0\theta_{t+1}=\gamma_t\theta_t+\sqrt{1-\gamma^2_t}\theta_0, where θ0p0(θ)=N(θ;0;σ2)\theta_0\sim p_0(\theta)=\mathcal{N}(\theta;0; \sigma^2) -- the models are the same. The have the mean γtμt\gamma_t\mu_t and the variance γt2σt2+(1γt2)σ02\gamma^2_t\sigma^2_t+(1-\gamma^2_t)\sigma^2_0.

  • μ0=μ0=θ00\mu_0=\mu'_0=\theta_0\neq0, the models have the same mean which are the one specific initialization of NN parameters. They are mathematically equivalent with the mean γtμt+(1γt)θ0\gamma_t\mu_t+(1-\gamma_t)\theta_0 and the same variance (as above)

  • μ0=θ0\mu_0=\theta_0 and μ0θ0\mu'_0 \neq \theta_0. These two models have the same variance, γt2σt2+(1γt2)σ02\gamma_t^2 \sigma^2_t + (1-\gamma^2_t)\sigma^2_0, but have different means: γtθt+(1γt)θ0\gamma_t\theta_t+(1-\gamma_t)\theta_0 versus γtθt+(1γt)μ0\gamma_t\theta_t+(1-\gamma_t)\mu'_0.

To implement variant 3, the mean μ0\mu'_0 is fixed at the beginning of the training, because otherwise we would be under-counting the variance in the corresponding drift model. It is unclear how to interpret this model nor how it will affect performance.

Thus, in order to answer your question about the impact of the particular initialization, we believe variant 1 is the right answer. We conducted an experiment of using Soft Reset variant 1 and 2 on random label MNIST (data efficient), where we swept over hyperparameters. We found that these variants had similar performance on this benchmark. We will add this experiment to the paper in the Appendix.

We hope your answer clarifies the questions and concerns you have raised, and we thank you for raising interesting points for discussion!

评论

I thank the authors for a couple of more detailed replies.

  1. About original drift model: Thank you for resolving most of my confusion. Now I think I understand the concept of drift model. Still, I am concerned about that, in the author's explanation, the local optimum θ_t\theta^{\ast}\_t seems a unique deterministic value/vector. However, in the era of overparameterized deep learning and because of its nonconvex nature, the θ_t\theta^{\ast}\_t may not be unique; rather, there might be a set of (non-unique) local optima. Because of this, I thought the drift model is a kind of mechanism of choosing a proper/desired θ_t\theta^{\ast}\_t among all local optima. Although I admit that θ_t\theta^{\ast}\_t not a random object (but still it can be because of randomness in data sampling) and θt\theta_t is an estimate of it (thus it is definitely a random object), considering the non-uniqueness of local optima, it seems vague and non-rigorous for me to say "getting closer to θ_t\theta^{\ast}\_t" or "concentrating aroung θ_t\theta^{\ast}\_t".

  2. About linearization: I am already aware of the derivation explained for reviewer 6GY1. All I wanted is just a kind derivation in your next revised paper. I wanted to take part in improving your paper.

  3. About variations of drift model: thank you for clearly answering my question. Personally, I was excited for suggesting a new point of view of your methodology, but it turns out it was somewhat meaningless (which is sad). Anyway, I am happy to understand the significance of the original formulation of the drift model.

Thank you again for the detailed response. Nonetheless, in my opinion, it would be better to judge a completely rewritten manuscript with another review process. Therefore, I maintain my score.

评论

We thank the Reviewer 4koq for their response and for sparking interesting discussion! Please see our answer below.

I am concerned about that, in the author's explanation, the local optimum θt\theta^{\ast}_t seems a unique deterministic value/vector. However, in the era of overparameterized deep learning and because of its nonconvex nature, the θt\theta^{\ast}_t may not be unique; rather, there might be a set of (non-unique) local optima.

Please note that when we discuss local optimas θt\theta^*_{t} and θt+1\theta^*_{t+1}, we do not assume their uniqueness. Rather, the only assumption we make is that our algorithm may converge to some local optima θt\theta^*_{t} (which is a member of a set of local optima), and after non-stationarity, a new set of local optima will appear, where one of the member is θt+1\theta^*_{t+1}. The reason why we use two specific points θt\theta^*_{t} and θt+1\theta^*_{t+1} in Figure 1, is to simplify the understanding and communicate the intuition behind the non-stationary behaviour and the impact of a drift model.

In the paper, in Section 3.1, we mainly refer to a region of new local minima, please see, for example, Lines 110-112: "If the change is such that the region of new good local minima is far away from θ\theta^*, SGD may take considerably higher amount of time to move its parameters towards this region."

Hope this clarifies the concerns.

About linearization: I am already aware of the derivation explained for reviewer 6GY1. All I wanted is just a kind derivation in your next revised paper. I wanted to take part in improving your paper.

Thanks to the useful discussion with you and the concerns you have raised, we think that including the full derivation of the linearization in the appendix of the paper, will greatly reduce confusion shared by you and other reviewers. We thank you for your suggestions, we will add more context in the paper.

审稿意见
4

This work focuses on the problem of plasticity loss in non-stationary problems. This work proposes a new solution which adaptively drifts the parameters toward the initial distribution. The proposed solution is a form of soft resetting and can be seen as a meta version of L2-init where the degree of drift towards initialization is learned from data.

优点

The proposed solution is novel. Current soft resetting methods reset the parameters of the model at a constant rate, irrespective of the degree of non-stationarity. The proposed solution introduces adaptive soft resetting, where the degree of resetting is calculated based on the degree of non-stationarity. This adaptiveness allows the solution to adapt faster and retain more prior knowledge. The experimental results show the effectiveness of the proposed solution.

缺点

Although the proposed solution is novel and effective, the paper has many weaknesses:

  1. Hyperparameter sensitivity. Given that the proposed solutions introduce many new hyperparameters and the authors say in line 207 that one of the solutions is sensitive to the choice of hyperparameter, the paper needs to contain hyperparameter sensitivity curves for all new hyperparameters.
  2. Computational cost. The proposed solutions seem computationally expensive. It would be good to include the wall of each algorithm beside one of the figures, and Figure 3 might be best.
  3. Limited performance improvement. The primary purpose of high replay-ratio is to improve the sample efficiency of algorithms. However, results in Figure 4 show that the proposed solution does not utilize a high replay-ratio. Its sample efficiency at replay ratio 1 is the same as its sample efficiency at replay ratio 128.

问题

See Weaknesses

局限性

Computational cost and hyperparameter sensitivity need to be discussed in the conclusion of the main paper.

作者回复

We thank the reviewer Hmj6 for their feedback. Please find our detailed answer below.

Hyperparameter sensitivity. Given that the proposed solutions introduce many new hyperparameters and the authors say in line 207 that one of the solutions is sensitive to the choice of hyperparameter, the paper needs to contain hyperparameter sensitivity curves for all new hyperparameters.

We have provided the answer about hyperparameters sensitivity above in the common answer for all the reviewers.

Computational cost. The proposed solutions seem computationally expensive. It would be good to include the wall of each algorithm beside one of the figures, and Figure 3 might be best.

We have provided the answer about computational complexity above in the common answer for all the reviewers.

Limited performance improvement. The primary purpose of high replay-ratio is to improve the sample efficiency of algorithms. However, results in Figure 4 show that the proposed solution does not utilize a high replay-ratio. Its sample efficiency at replay ratio 1 is the same as its sample efficiency at replay ratio 128.

Our choice of Reinforcement Learning experiment is motivated by the Primacy Bias [1] paper, where a significant plasticity loss occurs and where the benefits of parameter resets were observed. In this setting, it was shown that as replay ratio increases, the performance of the algorithm (Soft Actor Critic) could significantly degrade and hard-resetting parameters once in a while, can significantly improve its performance. With this in mind, our Figure 4 demonstrates that Soft Reset can in fact be an even more effective strategy when off-policy ratio increases – it can learn when to reset parameters and by what amount, compared to Hard Resets which require setting these parameters manually.

We are happy to take further questions on any of the above matters, and we hope that we have been able to address your concerns. If we have done so, we would be grateful if you would consider increasing your review score.

References:

[1] The Primacy Bias in Deep Reinforcement Learning, Evgenii Nikishin, Max Schwarzer, Pierluca D'Oro, Pierre-Luc Bacon, Aaron Courville, 2022

作者回复

Dear reviewers, thank you for your feedback. In this section we provide answers to recurrent points from some of you.

Computational Complexity

The following tables will be added to the Appendix together with a reference to this appendix at the end of Section 3.

Notations:

  • P - number of parameters of Neural Network (NN)
  • L - number of layers.
  • O(S)O(S) - cost of backwards pass of SGD
  • KK - number of Monte Carlo (MC) samples in eq.7
  • JJ - number of iterations in eq. 7
  • II - number of parameter updates for proximal methods in eq.14 and eq.18
  • MM - number of MC samples for Bayesian method in eq. 12
MethodComp. cost.Memory
SGDO(S)O(P)
Soft resets γ\gamma p. layerO(JKS + S)O(L+(K+1)P)
Soft resets γ\gamma p. param.O(JKS + S)O(P+(K+1)P)
Soft resets γ\gamma p. layer + proximal (I iters)O(JKS + I S)O(L+(K+1)P)
Soft resets γ\gamma p. param. + proximal (I iters)O(JKS + I S)O(P+(K+1)P)
Bayesian Soft Reset Proximal (I Iter) γ\gamma p.layerO(JKS + 2M I S)P(L+(K+2)P)
Bayesian Soft Reset Proximal (I Iter) γ\gamma p.paramO(JKS + 2M I S)P(P+(K+2)P)

The table above denotes the theoretical cost of each of the methods. In practice, we use J=1J=1 for Soft Resets and J=10J=10 for proximal methods. The table below quantifies the exact cost for methods from Figure 3.

MethodComp. cost.Memory
SGDO(S)O(P)
Soft resetsO(2S)O(L+2P)
Soft resets more computeO(11S)O(L+2P)
Soft resets proximalO(10S + 10S)O(L+2P)
Bayesian Soft Reset ProximalO(10S + 20S)P(L+3P)
Bayesian Soft Reset Proximal γ\gamma p.paramO(10S + 20S)P(P+3P)

From Figure 3, we see that it is beneficial to spend more compute on optimizing γ\gamma and NN parameters. However, even the cheapest Soft Resets leads to a good performance.

In Reinforcement Learning (RL) experiment, we do one update on γ\gamma after each new chunk of fresh data. We do GG updates on NN parameters with a cost O(S)O(S) each. The table below denotes the complexities.

MethodComplexityMemory
SACO(G S)O(P)
Soft ResetO(S + G S)O(L + 2 * P)

Soft Reset is marginally more expensive than SAC in RL but leads to a better performance (see Figure 4) in a highly off-policy regime.

Hyperparameters sensitivity

We study the sensitivity of Soft Resets where γ\gamma is defined per layer.

Fixed parameters:

  • Number of MC samples K=1K=1 and M=1M=1
  • Learning rate for parameters was tuned prior to that and equals to α=0.1\alpha=0.1.

Sensitivity parameters (see Algorithm 1 for precise definitions):

  • ηγ\eta_{\gamma} - learning rate for the drift model
  • ff - initial prior standard deviation rescaling, i.e., σ0l=f1H\sigma^l_0 = f \frac{1}{\sqrt{H}} where HH is the width of the layer ll
  • ss - posterior scaling, i.e., σtl=sσ0l\sigma^l_t = s * \sigma^l_0.

We provide the plots for the sensitivity analysis of Soft Reset on MNIST (data efficient) in the attached pdf. On top of that, we conduct the sensitivity analysis of L2 Init [1] and Shrink&Perturb [2] methods. The X-axis of each plot denotes one of the studied hyperparameters, whereas Y-Axis is the average performance across all the tasks (see Experiments section for tasks definition). The standard deviation is reported over 3 random seeds. A color indicates a second hyperparameter which is studied, if available. In the title of each plot, we write hyperparameters which are fixed.

Takeaways

The most important parameter is the learning rate ηγ\eta_{\gamma} of the drift model. For each method, there exists a good value of this parameter and performance is sensitive to it. This makes sense since this parameter directly impacts how we learn the drift model.

The performance of Soft Resets is robust with respect to the posterior standard deviation scaling ss parameter as long as it is s0.5s\geq0.5. For s<0.5s<0.5, the performance degrades. This parameter is defined from σposterior=sσprior\sigma_{posterior} = s \sigma_{prior} and effects relative increase in learning rate (see eq. 13 and eq. 17). This increase is given by 1/(γ2+(1γ2)/s2))1/(\gamma^2 + (1-\gamma^2)/s^2)), which could be ill-behaved for small ss.

We also study the sensitivity of the baseline methods. We find that L2 Init is very sensitive to the parameter λ\lambda which is a penalty on θθ02||\theta-\theta_0||^2 term. In fact, Figure 2, left shows that there is only one good value of this parameter which works. Shrink&Perturb is very sensitive to the shrink parameter (λ\lambda). Similar to L2 Init, there is only one value which works, 0.99990.9999 while values 0.9990.999 and 0.999990.99999 lead to bad performance. This method however, is not very sensitive to the perturb parameter σ\sigma provided that σ0.001\sigma \leq 0.001.

Compared to the baselines, our method is more robust to the hyperparameters choice.

We also conduct sensitivity analysis for other variants of the method, but due to space constraints, will only include the results in the camera ready version. The take-aways are similar.

References:

[1] Maintaining Plasticity in Continual Learning via Regenerative Regularization, Saurabh Kumar, Henrik Marklund, Benjamin Van Roy, 2023

[2] On Warm-Starting Neural Network Training, Jordan T. Ash, Ryan P. Adams, 2020

最终决定

This paper proposes to adapt to non-stationarity in neural network training using an Ornstein-Uhlenbeck process with an adaptive drift parameter. The proposed method estimates the drift parameter online and allows for automatic soft resets of parameters when needed. The approach is evaluated on plasticity benchmarks and off-policy reinforcement learning tasks. The reviewers all agree that the main idea is interesting and novel.

However, the significance of the contributions are not clearly demonstrated due to the lack of clarity in the submission. In its current form, even the most engaged readers have some difficulty grasping the core technical ideas, e.g., see the discussion thread with Reviewer 4koq. In the rebuttal process, the authors commit to resolving several of these issues—I will not repeat them here—which will be critical to the paper having broad impact. In particular, I strongly recommend adding a candid discussion on the computational costs of the proposed approach.