5.5

/10

Rejected4 位审稿人

最低3最高8标准差1.8

3.3

置信度

正确性2.5

贡献度2.8

表达2.8

ICLR 2025

Permutation Invariant Learning with High-Dimensional Particle Filters

Akhilan Boopathy,Aneesh Muppidi,Peggy Yang,Abhiram Iyer,William Yue,Ila R Fiete

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We propose a particle-filter based learning algorithm that is invariant to the permutations of the batches or tasks presented to it.

摘要

关键词

permutation-invariant learningcontinual learningloss of plasticitycatastrophic forgettingparticle filterhigh-dimensional

评审与讨论

审稿意见

评分: 5置信度: 42024-10-31

This paper discusses a particle filter scheme (named as WPF) which tentatively provide a solution of continual tasks learning given high-dimensional parametrized models. WPF method reweighs each particle samples by manipulating the posterior distribution of parameters with gradient of task losses which are computed in a sequential order. Authors justify that WPF methods is permutation-invariant, can prevent notorious forgetting behavior and can prevent "loss of plasticity". There are some experiments conducted over SplitMNIST, SplitCIFAT1100 and ProcGen to show that: i) WPF can be a replacement of vanilla gradient-based optimizer and it can combine with loss regularization techniques (SI, LWF, EWC in this paper) to show performance improvement under both CL and LRL environments ii) WPF effectively reduces the variance under both CL and LRL environments;

优点

The idea to apply particle filter technique is not particularly novel. However, the efficient realization in a high-dimensional parameter space and show its improvement in continual learning setup is a solid contribution. The author conduct extensive experiments to quantitatively justify the claim.

缺点

The major disappointment is the inconsistency between what authors attempt to justify in theory and what practical experiments reflect. Frist of all, the notation setup in Section 3.1 is leading confusion. In Equation (2)(3), a better statement of the posterior distribution is to define $p_t$ under conventional setup: $p_t(x):= p(x_t|L_{1:t})$ , and it will improve the readability. While defining the distribution with sequential observation of loss using $\hat{p}_0[L_1,L_2,\ldots,L_T]$ is valid, the sudden show up of $\hat{p}[L]$ in line 158 is not properly defined. Adding some notation explanations would be appreciated.

Secondly, I personally did not view permutation invariance as a contribution bulletin. Theorem 1 (line 177-182) justify that a distance metric $D$ is bounded by some constant (including $N,\epsilon,T,C$ ), and these constants do not have any constraints to indicate if the distance metric converges. Throughout the paper authors fail to give one example of a practical formula of metric $D$ . Moreover, the author then justify in line 182-183 that Equation (9) implies particle filters are "approximately permutation invariant", without any further qualitative analysis. I am expected to see if there are some experiments to justify/support Theorem 1 , but the only statement in line 510-525 is the performance improvement of reweighted particles which convey readers that "variance reduction implies permutation invariant". Please justify why permutation invariant can be verified by computing variance of repetitive experiments. I understand that WPF is capable of achieving better performance and lower variance under 10 different permutations runs, but this has nothing to do with the statement in Theorem 1.

To strengthen the claim, there are two potential improvements. One way is to show that variance-reduced gradient descent scheme (rather than using a fixed-step gradient descent scheme as per baseline), fail to achieve better overall performance over many runs of permutations. One classical variance-reduced gradient descent scheme is SVRG. Another way is to assign a computable metric (e.g. Wasserstein distance) and quantitatively validate theorem 1.

Besides, even though I assume all proof is correct, the author fails to bring attention in experiments to see which part of the result reflect that theorem 2 is correct, which is an upper bound of loss. All reported numbers are mean performance by weighted average of model parameters, and its variance accordingly. Moreover, two assumptions stated in theorem 2 is not validated in experiments as well.

Lastly, an equivalent statement of author's implementation, after reading algorithm 1, is: "WPF repeat gradient descent $N$ times, where $N$ is the number of particles, and reweigh each particles by an exponential term along the path." Then, I am expected to see authors justify why this is not considered as an Mixture-of-Expert(MoE) scheme, and authors ignored the existence of MoEs at all. It would be better if, at least, discuss or compare MoE schemes under this continual learning setups.

This paper has some presentation flaws of experimental details as well, please refer to questions for details.

问题

Apart from aforementioned concerns, there are some specific questions that I wish to hear from authors.

Major doubts:

For all conducted experiments, are there any numbers regarding dimension of particles can be reported?
Though WPF can be combined seamlessly with other schemes, what is the additional computation cost introduced by flowing multiple particles? From my understanding, the cost increases linearly with number of particles. Any specific runtime comparisons between WPF and baseline methods, as well as how the runtime scales with the number of particles?
In Theorem 3, linear loss function is assumed. However, in classification task setup (e.g. SplitMNIST,SplitCIFAR100), I presume the loss is the CrossEntropy loss, am I right? If sequential losses are CrossEntropy losses, then follow-up questions pops up: how to justify probability density matches in this case? How Theorem 3 relates to the experimental setup with non-linear loss functions?
If in Algorithm $1$ , we set number of particles $N=1$ , is it equivalent to a fixed step size gradient-descent? If so, in experiments, is the benchmark GD implemented with fixed step size?
Following up the previous question, what is the hyperparameter $\sigma$ 's role in controlling the WPF's performance. I didn't find the hyperparameter setup $\sigma$ in Section 4. Please specifiy the number used in experiments and if possible, justify the choice of such numbers.

Minor questions/suggestions:

In line 117, what does $L_t\in \mathbb{R}^d\rightarrow \mathbb{R}$ mean? If $L_t$ is a loss function, then please cosider revise the notation.
Is BC listed in Table 2 abbreviation of "Behavior Cloning"?
Considering adding statement in Figure 3 to indicate the uncertainty band is under one-times standard deviation, two-times standard deviations or three-times standard deviations. Two blue lines is not a good idea either.
Line 461, please consider avoid using colloquial statement like "interestingly".

2024-11-20

Thank you for your detailed feedback and for acknowledging the solid contribution of our work. We address your concerns point by point.

Clarification on Notation in Section 3.1

We apologize for any confusion regarding the notation. We have revised our text to explain our notation more carefully now.

Notation $L_t: \mathbb{R}^d \rightarrow \mathbb{R}$ : This means that the loss function $L_t$ takes a parameter vector $x \in \mathbb{R}^d$ and outputs a scalar loss value in $\mathbb{R}$ .

Definition of $p_t$ : We define $p_t(x)$ as the posterior distribution over model parameters after observing losses up to time $t$ . We are happy to clarify any further particular points of confusion.

Significance of Permutation Invariance and Theorem 1

Practical Relevance of Theorem 1: We are unsure what you are referring to when you ask if the distance metric "converges"; we do not make any claims of convergence with respect to any variable. Instead, Theorem 1 provides a bound on how the order of data affects the particle filter's output. With $C$ close to 1 and small $\epsilon$ , the discrepancy remains small, indicating approximate permutation invariance. In fact, when $\epsilon=0$ , the algorithm is exactly permutation invariant.

Relation to Variance Reduction: We assess permutation invariance by measuring the variance in model performance across different data permutations. Lower variance suggests that the model's output is less sensitive to data ordering, which aligns with permutation invariance.

To further validate the connection between permutation invariance and variance reduction, we are conducting additional experiments with the baseline of SVRG as suggested. We will update with empirical results as soon as they are available.

Validation of Theorem 2 in Experiments

Theorem 2 provides a performance guarantee against both catastrophic forgetting and loss of plasticity. We explicitly demonstrate the avoidance of catastrophic forgetting and loss of plasticity in our experiments in Section 4.1. We are happy to make any modifications to our text to make this connection clear.

As we mention in Section 3.3, the assumptions in Theorem 2 are conditions on the task: the loss function must be sufficiently smooth (equation 10) and easy to optimize with SGD (equation 11). In practice, these are highly reasonable assumptions: the tasks we test on have non-pathological loss landscapes and are amenable to optimization as demonstrated by our experiments in all settings.

Difference from Mixture-of-Experts (MoE)

Our method differs from MoE in two key aspects:

Ensemble vs. Expert Selection: In MoE, only one expert (or a subset) is active for a given input, whereas our method combines all particles' outputs, representing the full posterior distribution.

Bayesian Foundation: Our particle filter is grounded in Bayesian inference, providing theoretical guarantees like permutation invariance and mitigation of forgetting, which are not inherent in standard MoE models.

Given the significant differences between MoE and our approach, we believe standard ensemble methods are a more appropriate baseline than MoE, and have used this in our experimental evaluation.

Dimensionality of Particles

Each particle encompasses the full set of model parameters. For the neural networks used in our experiments, this means each particle has the same dimensionality as the network's parameter vector. Model architecture details are discussed in Appendix D and in our attached code.

Computational Cost and Runtime

While maintaining multiple particles increases memory usage linearly with the number of particles, the updates for each particle are independent and can be parallelized efficiently, resulting in no runtime overhead. On hardware like GPUs, given sufficient memory, this practically results in runtimes comparable to single-model training.

Justification of Theorem 3 in Non-Linear Settings

Theorem 3 illustrates that our method aligns with Bayesian updates in a simplified setting. Although Theorem 3 considers linear loss functions, it demonstrates our particle filter exactly maintains the correct posterior ratios between particles, preserving Bayesian properties. We may expect that this exact equivalence is relaxed when the loss functions become nonlinear, but it still holds approximately to the extent that loss functions are locally linear.

Relation to Gradient Descent

With $N = 1$ , our algorithm reduces to standard gradient descent with a fixed step size. We use SGD with a fixed learning rate of 0.01 for the gradient descent baselines in our experiments.

Minor Points

Loss Function Notation: We clarify that $L_t: \mathbb{R}^d \rightarrow \mathbb{R}$ means the loss function maps parameter vectors to scalar losses.

Abbreviation "BC": Yes, "BC" stands for Behavior Cloning.

Figure Captions and Language: Thank you for these suggestions. We have modified our text to clarify these points.

评论- Comments based on Author's comment

2024-11-20

Thanks for the reply of the concern. I briefly checked the revision and have some more comments/suggestions/questions.

First of all, notation in line 120 didn't change, it's still $L_t\in \mathbb{R}^n\rightarrow \mathbb{R}$ . Please revise it. There is an obvious typo in line 226 as well, "NOw" shall be now. Please carefully check your revision.
I appreciate author's clarification regarding their theorems. Nevertheless, several claims regarding their theorems might need further clarification. To be more specific, I would raise the score if the following questions can be answered appropriately:
- The "converges" question regarding in theorem 1 is answered by the claim that "With $C$ close to 1 and $\epsilon$ small", but I don't see why it is true that you can choose $C$ close to 1 and $\epsilon$ can be small enough, without defining $D$ . This claim needs to be checked by defining a metric $D$ and justify by some more evidence, it can be either theoretical analysis or numerical validation, but not statement like "we believe ......". Why Eq (5) (6) holds in general is also a myth to me.
- I think the similar question arises when checking theorem 2. Namely, by using either CE loss or RL task loss as an example, explain what is the estimated value of $k$ and $\beta$ stated in Eq. (10) (11) that make the condition holds. It will help all readers understand the significance of the bound stated in Eq. (12). Another way is to explain the claim that PL condition + GD scheme automatically satisfy the condition in Eq. (10)(11), by extending the claim as a solid example to enlighten readers that these two conditions are easily met in general learning task.
- The claim "We may expect that this exact equivalence is relaxed when the loss functions become nonlinear, but it still holds approximately to the extent that loss functions are locally linear." needs to be checked under experiment setups in Section 4. Namely, why loss in experiments are locally linear? Another clarification is that why Eq. (26) lead to " approximating Bayes optimal solutions". There is no Bayes optimization problem even being defined.

2024-11-21

Thank you for your timely and thoughtful follow-up and for bringing these important points to our attention. We appreciate your willingness to reconsider your evaluation of our paper. We have carefully addressed each of your concerns below and made corresponding revisions to the manuscript to enhance clarity and rigor.

Typos and Notation Corrections

Thank you for pointing out these oversights.

Line 120: We apologize for the confusion. The notation indicates that the loss function maps from the parameter space $\mathbb{R}^d$ to the real numbers $\mathbb{R}$ , which we have explained in our latest revision. We are happy to update the manuscript if the reviewer suggests another notation for this.

Line 226: Thank you for catching the typo; we have fixed this in our latest revision.

Clarification on Theorem 1 and Equations (5) and (6)

We appreciate the need for a concrete definition of $D$ and justification for the constants $C$ and $\epsilon$ . Here's a detailed clarification:

Defining the Discrepancy Measure $D$ : We may define $D(p, q)$ as the Wasserstein distance (specifically, the 2-Wasserstein distance) between two probability distributions $p$ and $q$ :

$D(p, q) = \left( \inf_{\gamma \in \Gamma(p, q)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x - y\|^2 d\gamma(x, y) \right)^{1/2}$ where $\Gamma(p, q)$ is the set of all couplings of $p$ and $q$ . The Wasserstein distance is appropriate for continuous distributions and provides a meaningful measure of discrepancy that satisfies non-negativity, symmetry, the triangle inequality, and $D(p, p) = 0$ .

Justifying $C \approx 1$ and Small $\epsilon$ :

Value of $C$ : In Equation (6), $C$ represents how the discrepancy between two distributions changes after an update. For our gradient-based particle filter with small step sizes $\sigma^2$ , the updates are gentle, and the movement of particles is limited. Specifically, for smooth loss functions $L$ with Lipschitz continuous gradients (i.e., $|\nabla L(x) - \nabla L(y)\| \leq L_{\nabla} \|x - y\|$ ), the gradient descent update is a Lipschitz continuous mapping with Lipschitz constant $L_\nabla$ . This supports $C \approx 1 + \sigma^2 L_{\nabla}$ , which approaches $1$ as $\sigma^2$ goes to $0$ .
Value of $\epsilon$ : The term $\epsilon$ captures the approximation error between the particle filter update and the true Bayesian update. In our gradient-based particle filter, the error due to linear approximation of $L$ can be bounded using Taylor's theorem. For twice-differentiable functions, the second-order remainder term involves the Hessian, and with small $\sigma^2$ , this term becomes negligible.

When Equations (5) and (6) Hold:

Equation (5): This is a property of the discrepancy measure. If we choose it to be Wasserstein distance, for example, then it holds automatically.
Equation (6): This inequality reflects the stability of the particle filter update with respect to the initial discrepancy between $\hat{p}$ and $\hat{q}$ . It holds under the assumption that the update operator is Lipschitz continuous in the Wasserstein distance. For our particle filter, this is justified by the small gradient steps and the smoothness of the loss function as we explain above.

Clarification on Theorem 2 and Conditions in Equations (10) and (11)

We appreciate the suggestion to use a solid example to illustrate these assumptions. We will use Wasserstein distance as our discrepancy metric $D$ for this example. Consider a loss function $L$ satisfying the PL condition and Lipschitz continuity of $L$ and its gradient:

$||\nabla L(x)||^2 \geq \mu L(x)$

$|L(x) - L(y)| \leq \ell ||x -y||$

$||\nabla L(x) - \nabla L(y)|| \leq L_{\nabla} ||x - y||$

where $\mu$ , $\ell$ , and $L_\nabla$ are constants, and the minimum value of the loss is $0$ . Recall that the PL condition and Lipschitz continuity of the gradient of $L$ guarantees that gradient descent steps of learning rate $\eta$ reduce the loss by a factor of $1 - (\eta - \frac{L \eta^2}{2}) \mu$ . Assuming our update rule is gradient descent, we then have:

Equation (10):

$\mathbb{E}_{x \sim p}[L(x)]$

$-\mathbb{E}_{x \sim q}[L(x)]$

$\leq \ell D(p, q)$

Equation (11):

$\mathbb{E}_{x \sim \hat p[L]}[L(x)]$

$\leq (1 - (\eta - \frac{L \eta^2}{2}) \mu) \mathbb{E}_{x \sim \hat p}[L(x)]$

On Loss Functions Being Locally Linear

We note that loss functions are locally linear in terms of the model parameters under two conditions:

The loss as a function of the model output is differentiable
The model output as a function of the model parameters is differentiable

In our experiments, we use losses and models that are differentiable everywhere (or almost everywhere in the case of ReLU-activated models). Thus, our loss functions are locally linear.

2024-11-21

On Approximating Bayes Optimal Solutions

Note that the key characteristic of a Bayes optimal particle filter is that the particle density in any region of the space is proportional to the true density $p_T$ in the region. Thus, we would expect the weight of any particle to be proportional to the true density:

$w^{(i)} \propto p_T(x^{(i)}_T)$

This is exactly what equation (26) shows in another form: the ratio of particle weights between any two particles is the ratio of their posterior density.

We hope that these explanations address your concerns and provide the necessary justification for our theoretical claims. If these explanations are satisfactory, we are happy to add them to the manuscript.

Your feedback has been invaluable and we appreciate your consideration of our responses. Please let us know if there are any further questions or if additional clarifications are needed.

2024-11-23

Thank you for the reply. However, these justifications does not fully resolve the concern.

From the setup you provide, $C\approx 1 +\sigma^2 L_{\nabla}$ is a claim one needs to prove as a lemma or theorem. On top of my head it is vague and I have no idea why it is an obvious conclusion, especially with Wasserstein distance setup. If possible, can you provide some evidence in previous works to support your claim? Justification of $\epsilon$ is fine, but it requires a clean statement in manuscript to reflect that this is a valid assumption. This assumption supports reviewer eGfq's concern that, if your learning time is long, say $T\gg 1$ , then $(1+\sigma^2)^{T-2}$ , with fixed sigma, is a diverging constant, it might give you a bound which fixed gradient descent scheme can achieve easily as well (think fixed scheme as dirac-delta density). How to reconcile the concern?

Moreover, I noticed authors might mix up the probability density and a fixed scheme. Explaining why Eq.(10) (11) holds is non-trivial: by stacking up the constant, the author implicitly assumed fixed gradient descent scheme, whereas the energy model $\exp(-L(x))$ , which reweigh particles, disappears when computing expectation. From my point of view, this is an inconsistency of presentation, I wonder if authors can explain further, whether or not $E_{\hat{p}[L]}[L(x)] = \int_x \frac{L(x)e^{-L(x)}}{Z} dx$ , with $Z$ partition function (same as normalization constant), and if this is the correct estimation as I understand, how eq (10)(11) hold in this case.

For locally linear assumption, I presume authors are claiming "locally differentiable". This part is fine but I wonder if there are some ways of validating the theorem 3 even if first order Taylor expansion is an approximation.

评论- Regarding Approximating Bayes Optimal Solutions

2024-11-23

I appreciate the justification from authors, and I would recommend authors state the fact in a more convinced manner by citing previous works. I wonder if Hoeting et. al. has justification of these statements or consider citation of some textbook like [1] which partially support your claim.

[1]: Simo Srkk. 2013. Bayesian Filtering and Smoothing. Cambridge University Press, USA.

2024-11-23

Thank you for your continued engagement with our paper and for your insightful comments. We appreciate your suggestions and have worked diligently to address each of your concerns. Below, we provide detailed responses and have made corresponding revisions to the manuscript to enhance clarity and rigor.

Approximating Bayes Optimal Solutions

Thank you for this suggestion. In the revised manuscript, we have more carefully explained how Theorem 3 implies that in the limit of infinitely many particles, the particle filter exactly matches the posterior distribution and have cited prior work in support of this (Doucet et al.).

Clarification on $C \approx 1 + \sigma^2 L_{\nabla}$ and the Wasserstein Distance

We appreciate the need for a formal justification. We are happy to provide a formal lemma here, and if satisfactory, can add it to our manuscript:

Lemma: Let $D$ be the 2-Wasserstein distance, and suppose the loss function $L$ has a Lipschitz continuous gradient with Lipschitz constant $L_{\nabla}$ . Then, for the gradient descent update $x_{t+1} = x_t - \sigma^2 \nabla L(x_t)$ , the following holds:

$D(\hat{p}[L], \hat{q}[L]) \leq (1 + \sigma^2 L_{\nabla}) D(\hat{p}, \hat{q})$ .

Proof Sketch:

The gradient descent update map $\phi(x) = x - \sigma^2 \nabla L(x)$ is Lipschitz continuous with Lipschitz constant $\text{Lip}(\phi) = 1 + \sigma^2 L_{\nabla}$ .
Using properties of the Wasserstein distance under Lipschitz mappings (Santambroglio, 2015), we have:

$D(\hat{p}[L], \hat{q}[L]) = D(\hat{p} \circ \phi, \hat{q} \circ \phi) \leq \text{Lip}(\phi) D(\hat{p}, \hat{q})D(p^[L],q^[L])$

We have also indicated the error scaling of all approximations in our particle filter derivation, and indicated the error we would expect in the final result as a function of $\sigma^2$ .

Exponential Dependence on $T$

You express concern that with $T \gg 1$ , the term $(1 + \sigma^2 L_{\nabla})^{T-2}$ may diverge, making the bound less practically useful. We acknowledge that exponential growth with respect to $T$ can be problematic. However, with small $\sigma^2$ and bounded $L_{\nabla}$ , $C^{T-2} = (1 + \sigma^2 L_{\nabla})^{T-2} \approx 1 + (T-2) \sigma^2 L_{\nabla})$ , which is only a linear dependence on $T$ . Note that this approximation requires $\sigma^2$ to approach zero faster than $T$ approaches infinity.

If $\sigma^2$ is fixed and $T$ approaches $\infty$ , then we don't believe it is possible to remove the exponential dependence of the bound on $T$ without stronger assumptions.

Clarification on Equations (10) and (11)

If we understand correctly, you are asking us to justify Equations (10) and (11) in the case that our update rule is not gradient descent, but rather our particle filter update. We justify this below.

Note that our particle filter has two update steps: 1) the gradient descent update, 2) the weight updates of each particle. We make the same assumptions about $L$ as before:

$||\nabla L(x)||^2 \geq \mu L(x)$

$|L(x) - L(y)| \leq \ell ||x -y||$

$||\nabla L(x) - \nabla L(y)|| \leq L_{\nabla} ||x - y||$

where $\mu$ , $\ell$ , and $L_\nabla$ are constants, and the minimum value of the loss is $0$ . We denote time $t$ as pre-update and time $t+1$ as post-update.

Equation (10) holds as before:

$\mathbb{E}_{x \sim p}[L(x)]$

$- \mathbb{E}_{x \sim q}[L(x)]$

$\leq \ell D(p, q)$

Equation (11):

We may first write,

$\mathbb{E}_{x \sim \hat{p}[L]}[L(x)]$

$= \sum_i L(x^i_{t+1})$

Since $x^i_{t+1}$ is found by taking a gradient step on $x^i_t$ , we have:

$L(x^i_{t+1}) \leq (1 - (\eta - \frac{L \eta^2}{2}) \mu) L(x^i_t)$

by the same argument from earlier. Summing over $i$ on both sides, we have:

$\mathbb{E}_{x \sim \hat{p}[L]}[L(x)]$

$\leq (1 - (\eta - \frac{L \eta^2}{2}) \mu) \mathbb{E}_{x \sim \hat{p}}[L(x)]$

Validating Theorem 3 with First-Order Taylor Approximation

To clarify, we believe Theorem 3 holds approximately (within some error bounds) when loss functions are differentiable (and thus locally linear).

If we understand correctly, you are asking whether Theorem 3 remains correct when we consider higher order terms: when perturbations to $x$ are large enough that $L$ changes non-linearly.

Under our analytical approach, which relies on local linearity of the loss function, we believe Theorem 3 would not remain true; in fact, the derivation of our particle filter itself relies on local linearity. On the other hand, we note that with sufficiently small step sizes (i.e. $\sigma^2$ near 0), any differentiable loss function can be treated as locally linear. Thus, we expect that the Bayesian properties of our particle filter break down with large step sizes.

We hope that these detailed explanations and revisions address your concerns. Your feedback has significantly improved the clarity and rigor of our paper. We are committed to ensuring that our work is both theoretically sound and practically relevant. Thank you again for your thoughtful consideration.

2024-11-26

Thank you for making these justifications. I raise the score due to the promise that all discussion regarding justification of theorems shall be added into manuscript after revision to improve the quality of presentation of this paper. Nevertheless, I am not fully convinced these theorems help readers understand how important the result in section 4 is, and the justification of these theorems are supposed to be incorporated before submission. The original hidden assumption of fixed gradient descent scheme in justification (rather than considering particle filter algorithm) harm the soundness of this paper, and my suggestion is a major revision in section 3 and reevaluate if these theorems are necessary to reflect the correctness of Algorithm 1.

2024-11-26

Thank you for your willingness to reconsider your evaluation!

We are absolutely committed to making any suggested changes that improve the presentation of our paper and we sincerely thank you for your suggestions.

审稿意见

评分: 6置信度: 32024-11-02

The paper "Permutation Invariant Learning with High-Dimensional Particle Filters" proposes a particle-filtering inspired continual learning system. A bound is dericed to show that the method is approximately permutation invariant, and a connection to the avoidance of catastrophic forgetting is established. Under the assumpation of Gaussian probability masses around the particles, a gradient-based update is derived. Empirically the method is shown to perform very competitive on two continual learning supervised datasets (SplitMNIST and SplitCIFAR100), and in one reinforcement learnign setting (ProcGen).

优点

the method outperforms, or performs very well against the baselines. In addition, it is shown that method is complementary, and when combined with other methods, can improve their performance.
the idea of the paper can be followed well (though due to the complexity of the problem some central derivations are in the appendix)
moving towards more real-world problems, the topic of continual learning is an important one to tackle for the community

缺点

The document should contain a section on Limitations. One thing that should be mentioned is that the algorithm needs N times more memory to keep the model parameters (or compute) for N being the particles, than a simple standard method with N=1. In particular, given that the community moved to very large models, this is a potentially very big limitation. If you made any design choices to overcome this limitation this should be discussed (or metnioned as future work).
the writing could be made more accessible. Overall the authors do a good job in explicating the ideas. However, the paper would profit from some high-level summaries and previews before more complex topics. For example, I'd recommend adding a summary description of the derivation in section 3.4 (Gaussian approximation of the particles and Taylor approximation + simplification of terms) before actually doing it. It should also be discussed that the algorithm is independently operating on all i (?) - the only coupling is in the ensembling with the weights. This is important for parallelization, which should be discussed.
In the abstract, it should be mentioend that the permutation invariance is approximate and not exact. E.g. change the text 'we introduce a novel permutation-invariant ...' to 'we introduce a novel approximately permutation-invariant ...'.
In Table 1 please make the best performing nubmers bold

问题

as the particle filter 'replicates' the model N times - how would a system of N times the size perform (potentially with other regularizations)? There are many architectures to make such a system efficient as well (e.g. Mixture of Expert ...). An experiment analyzing the performance as a function of model size (could be for N=1, but should be compared to the method proposed) should be added to clarify this.

2024-11-20

Thank you for your constructive feedback and for acknowledging the strengths of our paper. We address your comments below.

Limitations Section

We appreciate the suggestion to discuss limitations. We have now added limitations to our conclusion section to produce a unified discussion section. We highlight two limitations:

Maintaining multiple particles increases memory usage linearly with $N$ . We acknowledge this as a limitation, especially with large models. One possible workaround is to train multiple models in series, although this trades the memory cost with a time cost.

Another limitation is that our method alone is often not as effective as in combination with other methods. This limits its use as a stand-alone algorithm, and we instead advocate using it as an ensembling strategy to enhance the performance of an existing method (to address continual learning or loss-of-plasticity, for example).

Accessibility of Writing

Thank you for your suggestions! We have made the specific changes you suggested and are happy to make any further modifications to improve the clarity of our work.

Permutation Invariance Claim

Thank you for this point; we have modified our abstract as you suggested.

Table Formatting

Thank you! We have now bolded the best-performing numbers in Table 1 for clarity.

Comparison with Larger Models

Thank you for this suggestion. We are conducting experiments to compare our method with larger single models and will include the results in the revised manuscript once available.

2024-11-28

Thank you for the feedback on the comments. I appreciate the additions you made to the paper. Reading also the comments of the other reviewers, I decided to maintain my score.

审稿意见

评分: 3置信度: 32024-11-04

In sequential learning, the permutation dependency of prevalent optimization algorithms such as gradient descent might suffer from catastrophic overfitting -- severe degradation in performance on earlier tasks -- or loss of plasticity -- limited adaptive capabilities to the new tasks. Nearly permutation-invariant training strategies can address these challenges. This paper proposes particle filters as an instantiation of nearly permutation-invariant training strategies and presents guarantees under some ideal assumption on the quality of the particle filters. To deal with the high-dimensional nature of machine learning papers, a gradient-based particle filter is proposed and evaluated on continual learning and lifelong reinforcement learning.

优点

The paper identifies an interesting perspective based on permutation invariance to tackle two challenges observed in sequential learning. Experimentally, the results seem to correlate with this idea.

缺点

Idealization of the particle filter updates: it is unclear why (6) and (7) hold for suitable constants. I recommend authors to give some examples so that this is more understandable.
The bounds in (8), (9), (12) are all exponential in $T$ and are potentially vacuous. These exponential terms in $T$ are considered as some constants and not discussed at all. I recommend authors to explain why they think these constants are small. As an example, consider Theorem 2 and (12). The loss is upper bounded by $M$ and I don't see why the exponential term is going to be much smaller than $M$ and it seems like (12) is vacuous.
It is unclear why the approximate solution in (15) satisfies the Bayesian properties of particle filters. This is stated in L329-331 but no justification isn't given apart from Theorem 3 which is for a very restricted setting. Therefore, I am skeptical that this is any different from N models independently trained with gradient descent.

问题

Can you explain the notation $\hat{p}[L]$ ?
Can you explain what do you mean in L147? Does the particle filter verify two competing conditions: estimating the full distribution $p_t$ and estimating the global minimizer of $p_t$ ? If all particles are estimates of the global minimizer, how come they represent the full distribution?
Why do you think it is possible to verify (6) and (7) for particle filters in high-dimensional problems?
Can you comment on exponential nature of your bounds in $T$ . Aren't they strictly worse that linear bounds in $T$ ? Any sublinear regret bounds from online optimization seems to be much tighter for the problem setting at hand.

2024-11-20

Thank you for your insightful review and for highlighting both the strengths and areas for improvement in our paper. We address your concerns point by point.

Clarification on Equations (6) and (7)

Why do Equations (6) and (7) hold for suitable constants?

Equation (6): This equation states that the discrepancy between the updated particle filter distributions $D(\hat{p}[L], \hat{q}[L])$ is bounded by $C$ times the discrepancy between the initial distributions. Intuitively, this means that if two particle filters start off similar, they remain similar after an update. This holds under the assumption that the particle filter update is Lipschitz continuous with respect to the discrepancy measure D. In practice, this is reasonable for particle filters where updates involve smooth operations like weighting and resampling.

Equation (7): This equation bounds the discrepancy between the updated particle filter distribution $\hat{p}[L]$ and the true Bayesian posterior $p(\cdot|L)$ . It reflects the particle filter's approximation of the Bayesian update, where $\epsilon$ captures the error due to approximations (e.g., finite number of particles, linearization). This is typical in particle filter analyses, where each update aims to approximate the true posterior as closely as possible.

Exponential Bounds in T and Potential Vacuity

We acknowledge that the exponential dependence on $T$ may seem concerning. However, the growth rate is governed by the constant $C$ , which represents the stability of the particle filter.

Interpretation of $C$ : In practice, $C$ is typically close to 1. It measures how small variations in particles propagate through updates: in other words, the stability of updates. If $C \approx 1$ , then small fluctuations in the initialization of particles does not significantly affect the outcome after training. We expect this to be a reasonable assumption for many practical algorithms.

Non-Vacuous Bounds: In Theorem 2, the loss is upper bounded by $\beta M$ plus a term involving $C$ and $\epsilon$ . If $\epsilon$ is small and $C$ is close to 1, the additional term remains small compared to $\beta M$ . Thus, the bound is not vacuous and provides a meaningful guarantee on the loss.

Comparison with Linear Bounds: While sublinear regret bounds from online optimization are tighter, our analysis focuses on the behavior of particle filters approximating Bayesian updates, which is a different setting. The exponential term arises due to the recursive nature of the discrepancy bound. Under our assumptions, these are the tightest bounds we can obtain.

Approximate Bayesian Properties of Our Solution

It is unclear why the approximate solution in (15) satisfies the Bayesian properties of particle filters.

Our gradient-based particle filter is designed to approximate the Bayesian update efficiently in high-dimensional spaces. Main Approximation in Equation (15): We use a Gaussian approximation around each particle and linearize the loss function. This is exact in the limit as the Gaussian variance $\sigma^2$ approaches zero, effectively capturing the local behavior of the loss function. Theoretical Justification (Theorem 3): Although Theorem 3 considers linear loss functions, it demonstrates that in this setting, our particle filter exactly maintains the correct posterior ratios between particles, preserving Bayesian properties. We may expect that this exact equivalence is relaxed when the loss functions become nonlinear, but still hold approximately to the extent that loss functions are locally linear.

Empirical Evidence: Our experiments show that our method achieves permutation invariance and mitigates catastrophic forgetting, consistent with Bayesian approaches. Additionally, our method differs from independent gradient descent models because particle weights are updated based on their likelihoods, reflecting the posterior distribution.

Clarification on Notation

Can you explain the notation $\hat{p}[L]$ ?

Yes, as we explain in Section 3.1, $\hat{p}[L]$ denotes the updated particle filter after processing the loss function $L$ . Starting from the particle distribution $\hat{p}$ , the update incorporates the information from $L$ to produce a new distribution $\hat{p}[L]$ .

Objective of the Particle Filter

Our particle filter aims to approximate the Bayesian posterior distribution $p_T(x)$ , which is proportional to $p_0(x) e^{-\sum_{t=1}^T L_t(x)}$ . While the posterior places higher probability density near the global minimizer of the total loss, it represents the full distribution over model parameters. The particles collectively capture this distribution, including uncertainty and multiple modes, rather than just estimating the global minimizer.

2024-11-22

Thank you for your comments and explanations. My position remains the same, and I will maintain my initial score.

I am not convinced by the arguments provided by the authors, but I still recommend that the authors incorporate the discussions they provided on Equation (6), (7), and the comments on exponential and linear dependencies on $T$ .

2024-11-23

Thank you for your feedback and for taking the time to consider our responses. We understand that you remain unconvinced by our arguments, but we appreciate your suggestion to incorporate the discussions on Equations (6) and (7) and the comments on the exponential and linear dependencies on $T$ .

In the revised manuscript, we have:

Discussed why equations (6) and (7) hold for our gradient-based particle filter
After Theorem 1, addressed the potential concerns regarding the bound's exponential growth with respect to $T$ and how, in practice, the constants involved mitigate this growth

We hope that these additions enhance the clarity and rigor of our paper. Your feedback has been valuable in improving the presentation and thoroughness of our work, and we are happy to further elaborate on any arguments we have provided. Thank you again for your time and consideration.

审稿意见

评分: 8置信度: 32024-11-04

The authors introduce a novel particle-filter based method for learning in deep neural networks. This method uses a set of particles and updates each particle using a local first-order approximation to the loss, which theoretically grounds both the updates to the particle itself and its weight with respect to the overall set of particles. This method therefore combines the appealing properties of gradient descent with respect to high-dimensional weight spaces with appealing properties of particle filters, in particular their (approximate) permutation independence. They then demonstrate that this method improves performance on a number of continual and lifelong learning benchmarks, in particular when combined with other continual learning methods.

优点

This is an excellent paper. The authors nicely set up the motivation by explaining prevailing challenges in continual learning, clearly explain why particle filters can address this challenge, and then explain their own method. The benchmark experiments nicely demonstrate the strength of their method. Indeed, I believe that this method could inspire a host of follow-up research, further digging into how to combine particle filters and gradient descent-based optimization. I therefore strongly recommend acceptance.

Below are a few parts of papers I appreciated in particular:

The introduction really clearly sets up the lens of permutation invariance, which provides a great motivation for particle filters.
Section 3 provides a set of mathematical insights that make this intuition rigorous.
The introduced algorithm is theoretically principled and I really appreciated that the authors were able to explain it so succinctly in the main text.
I thought the authors' finding that this method can be combined with a range of other continual learning methods was really insightful and moved beyond a mere benchmark comparison.

缺点

I have two primary concerns I'd like to see the authors address:

1 Further related work

Your method seems related to the use of ensembles for continual learning, e.g. [1-3]. Could you discuss the relation of your paper to this prior work?

2 Connection between sections 3.2-3.3 and 3.4

In sections 3.2 and 3.3 you provided a set of guarantees about particle filters under assumptions (6) and (7). You then introduce your own method, but don't explain whether this method meets these assumptions. Could you discuss whether your method meets these assumptions and if so, what these constants are for your method? If it is unknown whether your method meets these assumption, could you make that clear?

Rypeść, Grzegorz, et al. (2024). "Divide and not forget: Ensemble of selectively trained experts in Continual Learning." The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=sSyytcewxe
Soutif–Cormerais, A., Carta, A. & van de Weijer, J.. (2023). Improving Online Continual Learning Performance and Stability with Temporal Ensembles. Proceedings of The 2nd Conference on Lifelong Learning Agents. https://proceedings.mlr.press/v232/soutif-cormerais23a.html.
Wen, Yeming, Dustin Tran, and Jimmy Ba (2020). "BatchEnsemble: an Alternative Approach to Efficient Ensemble and Lifelong Learning." International Conference on Learning Representations. https://openreview.net/forum?id=Sklf1yrYDr

问题

See weaknesses.pure

Further questions/suggestions:

It seems that you're providing an initial connection between particle filters and gradient descent which seems to provides a lot of potential for future work in extending these methods, e.g. with adaptive gradients. You note the potential for such future work in the last sentence, but I'm curious if you have any directions for future work that you are particularly excited about?
L. 256: Under linear approximation, yes, but in practice, this equality is approximate, right? So when defining $w_{t+1}^{(i)}$ , do you use the approximation or the exact quantity for $L_{t+1}(x_{t+1}^{(i)})$ ?
A minor point: I think formatting table 1 to clearly delineate continual learning baselines versus continual learning baselines + particle filters (e.g. by using two different columns for the baseline alone versus the baseline + PF) would make it easier for the reader to compare them both to each other and to the Particle Methods above.

2024-11-20

Thank you for your positive feedback and encouraging comments. We are glad that you found our method theoretically principled and practically effective. We address your questions below.

Relation to Ensemble Methods in Continual Learning

Thank you for pointing out relevant works on ensembles in continual learning [1-3]. We have added a discussion in the related work section.

While these works employ ensembles to mitigate forgetting, our method differs by providing a Bayesian framework with theoretical guarantees like permutation invariance. Additionally, we focus on approximating the posterior distribution over model parameters rather than just maintaining multiple models.

Connection Between Sections 3.2-3.3 and 3.4

The particle filter proposed in Section 3.4 does satisfy the conditions of Section 3.2-3.3 given three key additional assumptions:

The gradient descent step of the algorithm is sufficiently stable such that equation 6 holds; small changes to the initialization of particles make only small changes to the trained particles
The step size $\sigma^2$ is small enough; this makes the assumption of the local linearity of the loss valid in Section 3.4
The task loss function $L$ satisfies the conditions of Theorem 2; these are conditions specifying how smooth and optimizable the loss function is

The specific constants to make these equivalences are highly dependent on the task loss function; thus, it is difficult to define a general set of constants that would hold these assumptions hold. Nevertheless, we expect that for any particular set of tasks, it is possible to define constants such that our proposed particle filter fits the assumptions in Sections 3.2-3.3.

Future Work Directions

We believe the one exciting direction for future work is applying our approach to machine forgetting. Machine forgetting is a challenging problem in which the goal is to modify a trained model so as to remove the effect of a particular training input (or set of inputs) on the trained model. With typical gradient-based learning, this is challenging because a single training input can completely change the course of a model's optimization trajectory, making it difficult to undo the effect of that input.

Permutation-invariant algorithms on the other hand can easily forget arbitrary past inputs because any training input can be considered the "last" training datapoint by permuting the training points. Forgetting then simply involves undoing the effect of the last learning step.

Equality in Line 256

Thank you for pointing this out. The equality on line 256 is approximate under linear approximation of the loss. In defining $w_{t+1}^{(i)}$ , we use the exact values computed from the loss function on the new point; thus, equation 23 is exact. We have updated our notation in the manuscript to reflect this.

Table Formatting

Thank you for the suggestion. We have reformatted Table 1 to clearly distinguish between continual learning baselines and those combined with our particle filter.

2024-11-26

Thank you for your response. I appreciate the points about future work in machine forgetting.

On the connection between section 3.2-3.3 and 3.4: I appreciate the clarification --- I feel like going through a specific example, so as to illustrate how these constants could be determined and in order to provide empirical illustration could be a helpful way of further illustrating this and help further connecting the sections.

That said, I have really enjoyed reading the paper and continue to think it should be accepted to the conference.

2024-11-26

Thank you for your response!

We agree that going through an example to connect our particle filter to the assumptions in 3.2-3.3 would certainly be useful and we are committed to adding this in our revision. Thank you for your valuable suggestion.

AC 元评审

2024-12-19

The paper tackles the problem of dependence of training on batch ordering in machine learning. In particular, the authors argue that issues such as forgetting and loss of plasticity can be addressed if the training is invariant to batch ordering. The authors note that the true Bayesian posterior can be decomposed as a product of conditional updates over the data. They then develop a particle filter scheme for approximating the true Bayesian posterior. They show that this method achieves good performance in continual learning and lifelong RL settings.

Strengths:

The paper is well-motivated and targets an important problem
The proposed method is simple and makes intuitive sense
Authors derive new theoretical bounds related to permutation invariance and forgetting with particle filtering methods
The empirical results reported are promising

Weaknesses:

The authors state that Bayesian model averaging has not been explored in continual learning. I believe this is not true, the idea that true Bayesian inference would resolve forgetting is very well known and motivated multiple papers, for example see [1, 2, 3, 4]. Similarly, particle evolution methods are not a new idea for approximate Bayesian inference, see e.g. [5, 6]. So the main novelty on the paper is specifically in applying (a possibly novel) particle filtering scheme to Bayesian continual learning.

Overall, BMA has not been extensively explored in the context of continual or permutation-invariant learning (Line 103)

As the reviewers pointed out, the bounds have exponential terms $C^T$ , which are potentially vacuous, depending on unknown constants $C$ .
The presentation should be improved. In particular, I (and several reviewers) was confused by the notation $\hat p(L)$ . I think the authors should explain carefully what a particle filter is and how its update rule works.
In the end, the method reduces to training an ensemble of independent SGD solutions, with a weighting scheme.
- A priori, this seems very unlikely to work. In particular if the batch ordering is the same for all ensemble members, then they all should experience similar levels of forgetting. No weighting on the ensemble members can fix forgetting.
Possibly most importantly, the experiments do not provide sufficiently strong evidence that the method is working well. In order to trust the results, the experiments should be conducted in a setting that was considered by prior work, with the same architecture. Causes for concerns:
- The authors do not specify the architecture they are using.
- On the image datasets, the absolute performance is very low: 48-80% on SplitMNIST and 19-29% on SplitCIFAR100.
- It is not clear to me if the same batch ordering is used across all ensemble components (particles). If that is the case, it is not clear why there is such extreme deviation in performance between ensemble members.
- It is not clear how exactly the ensemble is evaluated. I am assuming it is the accuracy with averaged predictions, but the paper says

with test accuracy evaluated as a weighted average across particles (L404)

Decision recommendation: Based on the arguments above, I recommend rejecting the paper in its current form. I believe that the paper could make a strong contribution if the following issues are addressed: (1) clarity of presentation, (2) adding all details on the experiments, (3) careful literature review of related work in Bayesian deep learning, and most importantly (3) significantly improved experimental evaluation.

[1] Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting Hippolyt Ritter, Aleksandar Botev, David Barber

[2] Continual Learning Using Bayesian Neural Networks HongLin Li, Payam Barnaghi, Shirin Enshaeifar, Frieder Ganz

[3] Bayesian Incremental Learning for Deep Neural Networks Max Kochurov, Timur Garipov, Dmitry Podoprikhin, Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov

[4] A Unifying Bayesian View of Continual Learning Sebastian Farquhar, Yarin Gal

[5] Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu, Dilin Wang

[6] Bayesian Inference with Anchored Ensembles of Neural Networks, and Application to Exploration in Reinforcement Learning Tim Pearce, Nicolas Anastassacos, Mohamed Zaki, Andy Neely

审稿人讨论附加意见

The paper received mixed reviews (2 accepting and 2 rejecting). Multiple reviewers expressed concerns over clarity of presentation and the interpretation of bounds. The authors provided a rebuttal and the reviewers engaged in a discussion. Two of the reviewers remained unconvinced by the rebuttal.

最终决定Reject

2025-01-22

Reject