5.8

/10

Rejected4 位审稿人

最低2最高7标准差2.2

7

7

7

2

3.5

置信度

正确性2.5

贡献度2.5

表达3.3

NeurIPS 2024

SymmetricDiffusers: Learning Discrete Diffusion on Finite Symmetric Groups

Yongxing Zhang,Donglin Yang,Renjie Liao

提交: 2024-05-14更新: 2024-11-06

摘要

Finite symmetric groups $S_n$ are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over $S_n$ poses significant challenges due to its intractable size and discrete nature. In this paper, we introduce SymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over $S_n$ by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded "denoising schedule" to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems.

关键词

Finite Symmetric GroupsDiscrete DiffusionPermutationRiffle Shufflesthe Plackett-Luce DistributionSortingJigsaw Puzzle

评审与讨论

审稿意见

评分: 7置信度: 12024-06-28

I am unable to review this paper as it lies outside my area of expertise.

优点

I am unable to review this paper as it lies outside my area of expertise.

缺点

I am unable to review this paper as it lies outside my area of expertise.

问题

I am unable to review this paper as it lies outside my area of expertise.

局限性

I am unable to review this paper as it lies outside my area of expertise.

审稿意见

评分: 7置信度: 42024-06-30

The authors aim to create a discrete diffusion model that generates permutations. This model can then be used to solve combinatorial problems including jigsaws and travelling salesman problems. To formulate their model they cover a range of forward shuffling strategies and discuss how to parametrize the reverse transition. During sampling, they also use beam search to find high probability samples. They find their method performs competitively on computational experiments.

优点

The idea of using riffle shuffles to create a corruption process over permutations and then parametrizing the time reversal of this process is novel and interesting. I enjoyed reading the paper. I believe further work will build on this as there are many instances in machine learning where needing to learn permutations crops up.

The paper is quite well written and easy to understand. It is not overloaded with mathematical equations and intuition is given for some concepts.

The experimental results seem promising as it performs on par or better (especially in high dimensions) than previous methods for learning permutations. I appreciate the ablation studies into the greedy search vs beam search and types of shuffle (in the appendix).

缺点

I think some further clarification is required for the weaknesses of the random transposition and random insertion style of card shuffling. You later say that you can merge steps of the forward process if each individual step does not induce enough mixing and so the stated weakness that these styles of shuffle have slow mixing seems moot.

You mention that you do not have access to $q(X_t | X_0)$ , and I think it should also be discussed that $q(X_{t-1} | X_t, X_0)$ is also unavailable since this distribution is used in standard diffusion models to re-write the variational bound in a lower variance form, see Appendix A in https://arxiv.org/pdf/2006.11239 .

You dedicate a lot of space to discussing the various forward noising processes with different shuffling methods, which is quite interesting. However, the ablations with these different styles of shuffle are in the appendix and I think it should be in the main since they have been given such prominence earlier on in the discussion of the method, it is strange they are not included in the main experiments.

I find it difficult to follow the description of the inverse transposition parametrization, there is no intuition given for the functions $\phi$ and $\psi$ nor the functional form of $p_{IT}(\sigma)$ . Perhaps this is due to space limitations but since inverse transposition is not in the main experiments (see above point), I think you should either relegate a lot of this to the appendix if you only use the riffle shuffle in practice, or try and shift the wording to properly explain these types of forward and inverse process and have experiments for them in the main.

问题

When you compute the loss function in equation (10), do you do T calls to the neural network because you calculate the full variational bound for every iteration? This would indeed be very memory intensive, isn't there an alternative where although you need to run the full forward process since $q(X_t | X_0)$ is intractable, you could still only compute the loss on some subset of the $p_\theta(X_{t-1} | X_t)$ reverse transitions.

局限性

The authors do a good job of discussing the limitations of various parametrizations of their method.

作者回复

2024-08-06

Thank you for the insightful and constructive comments. We appreciate your positive feedback and address the questions below.

Q1: You later say that you can merge steps of the forward process if each individual step does not induce enough mixing and so the stated weakness that these styles of shuffle have slow mixing seems moot.

A1: We’d like to clarify that we can only merge steps in the reverse process and not in the forward process. While merging steps in the reverse process makes loss computation more efficient, we still need to run the whole forward process for random transpositions and random insertions. And since the mixing time for random transpositions and random insertions are much slower than riffle shuffles, riffle shuffles should be the preferred shuffling method.

Q2: I think it should also be discussed that $q(X_{t-1}|X_t,X_0)$ is also unavailable since this distribution is used in standard diffusion models to re-write the variational bound in a lower variance form...

A2: Yes, the posterior $q(X_{t-1}|X_t,X_0)$ is unavailable for most shuffling methods, so we cannot re-write the ELBO in a lower variance form like in common diffusion models. It is worth noting that even though $q(X_{t-1}|X_t,X_0)$ is available for riffle shuffles, the common parameterization presented in diffusion models is still not helpful. Please refer to the General Replies for a detailed discussion.

Q3: However, the ablations with these different styles of shuffle are in the appendix and I think it should be in the main …

A3: We appreciate this suggestion and will reorganize the paper's structure to include the ablation.

Q4: I find it difficult to follow the description of the inverse transposition parametrization, there is no intuition given for the functions $\phi$ and $\psi$ nor the functional form of $p_{\text{IT}}(\sigma)$ ...

A4: We apologize for the confusion regarding the inverse transposition parameterization. Due to the space limit, we indeed shortened the description. We will reorganize the sections and include the following intuitions.

The support of the IT distribution is the set of all transpositions plus the identity permutation. Let us treat the identity permutation $\mathrm{Id}$ separately, and we use a parameter $\tau$ to assign $p_{\mathrm{IT}}(\mathrm{Id}) = 1 - \mathrm{sigmoid}(\tau)$ . Then we assign probabilities to the transpositions. A transposition is essentially an unordered pair of distinct indices, so we use $n$ parameters $\mathbf{s}=(s_1,\ldots,s_n)$ to represent the logits of each index getting picked. Thus, for indices $i\neq j$ , it is natural to let

p_{\mathrm{IT}}(\sigma) = \mathrm{sigmoid}(\tau)\left( \frac{\exp(s_i)}{\sum_{k=1}^{n}\exp(s_k)}\cdot\frac{\exp{(s_j)}}{\sum_{k\neq i}\exp(s_k)} + \frac{\exp(s_j)}{\sum_{k=1}^{n}\exp(s_k)}\cdot\frac{\exp{(s_i)}}{\sum_{k\neq j}\exp(s_k)} \right)

for a transposition $\sigma=\begin{pmatrix}i&j\end{pmatrix}\neq\mathrm{Id}$ , where $\mathrm{sigmoid}(\tau)$ is the probability of not choosing $\mathrm{Id}$ . The expression in the parentheses is the probability of choosing the unordered pair $i$ and $j$ , which is equal to the probability of choosing $i$ and then $j$ , plus the probability of choosing $j$ and then $i$ .

Q5: Isn't there an alternative where although you need to run the full forward process since $q\left(X_t | X_0\right)$ is intractable, you could still only compute the loss on some subset of the reverse transitions…

A5: Our framework allows computing the loss on some subset of the trajectory to be more memory efficient. For example, we could randomly sample one timestep from a denoising schedule $[t_0,\ldots,t_k]$ and compute the loss as

\mathbb{E}\_{p_{\text{data}}(X_0,\mathcal{X})} \mathbb{E}\_{i}\mathbb{E}\_{q(X_{t_{i-1}}|X_0)} \mathbb{E}\_{q(X_{t_{i}}|X_{t_{i-1}})} \Big[-\log p_{\theta}(X_{t_{i-1}}|X_{t_i})\Big],

omitting constant terms with respect to $\theta$ . In fact, for riffle shuffles, we could sample $X_{t_{i-1}}$ directly for arbitrary timestep $t_{i-1}$ from $X_0$ [1]. For other shuffling methods, we would have to run the entire forward process. It is also worth noting that computing the loss on a subset of the trajectory could potentially introduce more variance during training. So, there is a tradeoff, and the exact design choice should depend on the problem setup.

In our current implementations of the experiments, we make $T$ (parallelized) calls to the neural network to compute the expectation w.r.t. time in the ELBO exactly. This is because, as discussed in our paper, the forward trajectories are usually short for riffle shuffles, so the exact computing of the variational bound is feasible and can help reduce variance.

[1] Dave Bayer and Persi Diaconis. Trailing the dovetail shuffle to its lair. The Annals of Applied Probability, 2(2):294 – 313, 1992.

评论- Response to Authors

2024-08-09

Thank you for the clear rebuttal, most of my questions have been cleared up, I especially like the extra intuition on the inverse transposition method and think it would be good to include in the paper.

Regarding the mixing of the simpler shuffling styles, I still don't fully agree with slow mixing times being the direct reason for preferring riffle shuffles. If you have a slow mixing forward process you can always increase T until your process has fully mixed. I think the reason to prefer riffle in your framework is precisely because you have to compute the full VLB for every iteration. This is not standard in the diffusion framework hence why it is strange to use slow mixing as a disadvantage of these corruption styles. I don't think this is that important and more a phrasing issue. I would have phrased the narrative more: shuffling corruptions require us to evaluate the full VLB because quantities that are usually analytic are not longer analytic -> T cannot be too big -> we need fast mixing forward processes. Some discussion then of why you don't use the subset of VLB idea might also be good to include since the need to evaluate the full VLB is somewhat limiting (it limits you to fast mixing forward processes).

I have raised my score to 7.

评论- Re Response

2024-08-09

Thanks for your appreciation and valuable feedback! We agree with your logic and will rephrase our narratives accordingly. Additionally, a fast-mixing forward process is preferred from the computational perspective since the corresponding reverse process permits fast sampling. More discussion on this point will be included in the later version.

审稿意见

评分: 7置信度: 42024-07-11

This paper proposes a discrete diffusion model to learn distribution over the finite symmetric group $S_n$ . The forward process is built off of random walks on finite groups (in this case, card shuffles), and the paper learns to reverse this diffusion process with standard discrete diffusion arguments.

优点

Overall, I really like the paper's contributions and presentation.

The idea is a neat application of the discrete diffusion ideas to an important area. In particular, the structure of $S_n$ is sufficiently different from standard image/text datasets as to necessitate this paper.
The presentation is very good and the contributions are numerous.
For the experiments listed, the method seems to provide a very strong improvement over baseline methods. In particular, these other methods are based on fundamentally different technology, so this highlights that discrete diffusion can become a very promising direction here.

缺点

There are three primary weaknesses. These should all be addressable to some degree, and I'll take any response into consideration when recalibrating my final score.

The model proposes to directly learn the reverse transition densities $p_\theta(X_{t - 1}, X_t)$ . The issue with doing this for standard diffusion models is that this seems to hurt model training since it increases the variance of training (as, in particular, one must sample the two $X_{t - 1}, X_t$ for training instead of just one $X_t$ ). As such, most works use the (ultimately equivalent) mean/score-parameterizations [1, 2]. I would want to hear a bit more about if this would be applicable in the $S_n$ case (and training with this parameterization might improve the model) or if this is not possible.
(Related to the above). Since most modern discrete diffusion methods are formulated in continuous time, I think the paper would benefit greatly with a discussion about potentially extending the current methods to this realm. In particular, works like [3, 4, 5] have established a working theory for discrete diffusion in continuous time, so it would be beneficial to discuss how the proposed framework might fit into the established theory.
The experiments, while showing good results, do not show that the method is particularly scalable, which seems to be a fundamental problem in prior work that was explicitly mentioned in this paper. In particular, it seems that the maximum value of $n$ in $S_n$ is 100. While some discussion is made here that talks about transformer layers, I think large values of $n$ aren't that big of an issue in transformers due to systems like Flash Attention. So, it should be made more clear if this is a fundamental problem with the existing method, or a larger scale example (even toy) should be presented.

[1] https://arxiv.org/abs/2006.11239

[2] https://arxiv.org/abs/2011.13456

[3] https://arxiv.org/abs/2205.14987

[4] https://arxiv.org/abs/2211.16750

[5] https://arxiv.org/abs/2310.16834

问题

N/A beyond addressing the weaknesses above.

局限性

Yes.

作者回复

2024-08-06

Thank you for the insightful and constructive comments. We appreciate your positive feedback and address the questions below.

Q1: The model proposes to directly learn the reverse transition densities $p_{\theta}(X_{t-1}|X_t)$ . The issue with doing this for standard diffusion models is that this seems to hurt model training since it increases the variance of training (as, in particular, one must sample the two $X_{t-1}$ , $X_t$ for training instead of just one $X_t$ ). As such, most works use the (ultimately equivalent) mean/score-parameterizations [1, 2]. I would want to hear a bit more about if this would be applicable in the $S_n$ case (and training with this parameterization might improve the model) or if this is not possible.

A1: We agree that learning $p_{\theta}(X_{t-1}|X_t)$ directly increases the variance of training. However, the reason why previous works can use mean/score-parameterizations is that they use Gaussian kernels on continuous data, which admits analytical forms for $q(X_{t-1}|X_t,X_0)$ and $q(X_t|X_0)$ .

However, such parameterizations are not applicable to $S_n$ since the transitions are not Gaussian. In particular, the posterior $q(X_{t-1}|X_t,X_0)$ is unavailable for most shuffling methods. We cannot rewrite the ELBO in a lower variance form like in standard diffusion models. It is worth noting that even though $q(X_{t-1}|X_t,X_0)$ is available for riffle shuffles, the common parameterization presented in diffusion models is still not helpful.

Please refer to the General Replies for a detailed discussion.

Q2: In particular, works like [3, 4, 5] have established a working theory for discrete diffusion in continuous time, so it would be beneficial to discuss how the proposed framework might fit into the established theory.

A2: Thanks for bringing up this point and mentioning the interesting work. Since the shuffling methods are inherently discrete-time Markov chains, it seems non-trivial to convert them into continuous-time Markov processes. In other words, the commonly used linear ODE parameterization in [3, 4] does not match the shuffling dynamics considered in our work. On the other hand, the concrete score and the score entropy idea in [5] seem to be compatible, i.e., we could possibly change the parameterization to develop the discrete-time counterpart in our context. We will explore this topic more and add the discussion in the later version.

Q3: The experiments, while showing good results, do not show that the method is particularly scalable… While some discussion is made here that talks about transformer layers, I think large values of $n$ aren't that big of an issue in transformers due to systems like Flash Attention. So, it should be made more clear if this is a fundamental problem with the existing method, or a larger scale example (even toy) should be presented.

A3: We agree that techniques like Flash Attention could improve our model’s scalability. A large value of $n$ is indeed not a big issue since our backbone network is also a Transformer.

During submission, since some metrics (e.g., accuracy) drop significantly as $n$ increases and other competitive methods do not scale up well, we only report the results with $n\le100$ for a better comparison.

During the rebuttal, we further investigated our model for the task of sorting 4-digit MNIST numbers with $n=200$ and show the results below.

Sequence Length	Kendall-Tau	Accuracy (%)	Correct (%)
200	0.317	0	39.4

This result is already better than the performance of sorting network based models on the easier task with $n=100$ (Diffsort/Error-free Diffsort: Kendall-Tau=0.166/0.140, Correct (%)=23.2/20.1). We will include more large-scale experiments in the later version.

评论- Thank you

2024-08-08

Thank you for answering my questions. I am raising my score contingent on the fact that the authors will work in the discussion that they proposed above.

评论- Re Response

2024-08-09

Thank you for your appreciation and valuable feedback! We will include the discussion in the later version.

审稿意见

评分: 2置信度: 52024-07-13

This paper introduces SymmetricDiffusers, a new approach to learning complex distributions. It works by breaking down the problem into simpler steps: learning how to reverse a transformation using deep neural networks. The authors identify a particularly effective method for this reversal step (the riffle shuffle) and provide guidance on choosing the right length for the process based on mathematical properties. Additionally, they propose a more powerful alternative to a common distribution (the generalized Plackett-Luce distribution) and a theoretically sound strategy for improving efficiency (the denoising schedule). Experiments show that SymmetricDiffusers performs extremely well on various tasks, including sorting images, solving puzzles, and optimizing routes.

优点

The proposed method in general is interesting and seems effective in the examples studied in this paper.

Also, the problem studied is interesting.

缺点

My main concern is the proposed method is not the state-of-the-art. A clear approach is to learn invariant features and equivariant group actions using some existing methods like that proposed in Robin Winter, et al. Unsupervised Learning of Group Invariant and Equivariant Representations, NeurIPS 2022. Such an approach.

In Robin Winter et al.'s paper, the authors have studied the symmetric group and my understanding is disentangling invariant features and equivariant groups enables the design of flexible diffusion models since you only need to perform diffusion modeling in the invariant latent space.

I want to see the comparison of the approach proposed in Robin Winter et al's paper. My understanding is that Robin Winter's approach enables building state-of-the-art diffusion models for molecular generation.

问题

My main concern is the proposed method is not the state-of-the-art. A clear approach is to learn invariant features and equivariant group actions using some existing methods like that proposed in Robin Winter, et al. Unsupervised Learning of Group Invariant and Equivariant Representations, NeurIPS 2022. Such an approach.

In Robin Winter et al.'s paper, the authors have studied the symmetric group and my understanding is disentangling invariant features and equivariant groups enables the design of flexible diffusion models since you only need to perform diffusion modeling in the invariant latent space.

I want to see the comparison of the approach proposed in Robin Winter et al's paper. My understanding is that Robin Winter's approach enables building state-of-the-art diffusion models for molecular generation.

局限性

My main concern is the proposed method is not the state-of-the-art. A clear approach is to learn invariant features and equivariant group actions using some existing methods like that proposed in Robin Winter, et al. Unsupervised Learning of Group Invariant and Equivariant Representations, NeurIPS 2022. Such an approach.

In Robin Winter et al.'s paper, the authors have studied the symmetric group and my understanding is disentangling invariant features and equivariant groups enables the design of flexible diffusion models since you only need to perform diffusion modeling in the invariant latent space.

I want to see the comparison of the approach proposed in Robin Winter et al's paper. My understanding is that Robin Winter's approach enables building state-of-the-art diffusion models for molecular generation.

作者回复

2024-08-06

Q1: In Robin Winter et al.'s paper, the authors have studied the symmetric group and my understanding is disentangling invariant features and equivariant groups enables the design of flexible diffusion models since you only need to perform diffusion modeling in the invariant latent space.

A1: Thanks for introducing Winter’s interesting work [1]. Specifically, [1] employs a graph-invariant encoder, $\eta$ , ensuring that regardless of the group action $\rho_X$ applied from group $G$ on a discrete state $x \in X$ , the encoder's output remains constant ( $\eta(\rho_X(g) x) = \eta(x), \forall x \in X, \forall g \in G$ ). This invariant characteristic is particularly beneficial for molecule generation [2], where the goal is to learn a probability distribution over molecules that is $\mathrm{SE}(3)$ -invariant.

However, we’d like to clarify that performing diffusion modeling in the invariant latent space can not solve the problem in our paper. In particular, our goal is to learn a distribution over a finite symmetric group $S_n$ , where it is critical that representations of different elements within the group are distinct. For instance, since distinct permutations result in different images in Jigsaw Puzzles, an invariant encoder can not differentiate between incorrect and correct permutations based on the latent representations. Therefore, a group-invariant encoder does not solve our problem.

On the other hand, one could use the group-equivariant encoder $\psi$ proposed in [1], which predicts group actions in one shot instead of performing diffusion. However, the resulting model would be essentially similar to the one in Gumbel-Sinkhorn [3], which has been shown to be less effective than ours in several tasks.

Therefore, the approach in [1] does not align with the objectives of our study. We will discuss this work in a future version.

[1] Winter, Robin, et al. "Unsupervised learning of group invariant and equivariant representations." Advances in Neural Information Processing Systems 35 (2022): 31942-31956.

[2] Xu, Minkai, et al. "Geometric latent diffusion models for 3d molecule generation." International Conference on Machine Learning. PMLR, 2023.

[3] Mena, Gonzalo, et al. "Learning latent permutations with gumbel-sinkhorn networks." arXiv preprint arXiv:1802.08665 (2018).

Q2: My main concern is the proposed method is not the state-of-the-art… I want to see the comparison of the approach proposed in Robin Winter et al's paper. My understanding is that Robin Winter's approach enables building state-of-the-art diffusion models for molecular generation.

A2: To the best of our knowledge, our method achieves state-of-the-art performances in the tasks of sorting 4-digit MNIST images and jigsaw puzzles. It is important to note that our focus is not on developing diffusion models for molecular generation.

评论- Thank you for your response

2024-08-08

I thank the authors' response. However, I respectfully disagree with the authors' statement that the goal of Winter et al's work is to learn a probability distribution over molecules that is SE(3)-invariant. Winter et al's work is quite generic, and they have discussed the symmetric group. The experiments conducted by Winter et al. are far beyond SE(3)-equivariance.

评论- Reply to the response

2024-08-08

As we have explained, although Winter et al.'s work [1] has discussed the symmetric group, 1) it focuses on representation learning instead of building probability distributions over symmetric groups, and 2) it can not directly solve our problem. We will elaborate on these two points below.

Their experiment for the symmetric group aims to learn representations to reconstruct a set of digits represented by a sequence of $D$ -dimensional digits with length $N$ . We aim to learn distributions over symmetric groups to solve jigsaw puzzles, sorting, and traveling salesman problems, which require assigning different probabilities to different group elements.
As aforementioned, their group-invariant encoder $\eta$ produces the same representation for different group elements. Thus, it can not be used to construct a distribution that assigns different probabilities to different group elements. Furthermore, their group-equivariant encoder $\psi$ predicts group actions in one shot instead of performing diffusion. This encoder alone is essentially the same as the one in Gumbel-Sinkhorn [3], which is less effective than ours in several tasks.

In summary, their experiments completely differ from what we have done, and their work can not solve our problem directly. We will discuss the connection in a future version.

PS: Our original statement is that the goal of molecule generation [2] is to learn a probability distribution over molecules that is SE(3)-invariant. We did not comment on the goal of Winter et al’s work [1].

评论- Response

2024-08-08

I thank the authors' further response but I do not think you have addressed my questions.

Winter et al. studied both symmetric groups and Euclidean groups - and also, provided experiments on tasks involving both groups.

I do not see any particular challenge in using Winter et al. 's approach to learn distribution over symmetric groups. It seems to follow a similar approach as Xu et al. GeoLDM, ICML 2023 but using an S_N-equivariant neural network as the encoder and so on.

For your second point, I highly encourage you to read Winter et al's work carefully. Your statement - As aforementioned, their group-invariant encoder $\eta$ produces the same representation for different group elements. Thus, it can not be used to construct a distribution that assigns different probabilities to different group elements - counters to what Winter et al. did. Notice that you need to use an equivariant encoder under different groups to disentangle invariant and equivariant features under different group actions.

评论- Reviewer UKH2 seems incorrect

2024-08-08

I noticed this thread was developing, and I feel obligated to say something.

Reviewer UKH2's comments seem to highly mischaracterize the contributions of Winter et al. to the point of being largely unfair and invalid. In particular,

[1] does not build a diffusion model anywhere in the paper, counter to claims made in the original comment. In fact, [1] is not even a probabilistic model at all. The work is effectively a equivariant AE (that operates by instead mapping to $X/G \times G$ instead of $X$ in the mathematical framework). Note that, as a strict autoencoder, it is not even probabilistically meaningful, which means that we aren't learning a distribution on $X$ anyways.
[1] does not seem to be SOTA. In particular, the work is qualitative and shows the performance of the embeddings in, for example, visualizing data. The work is not quantitative and does not present any major numerical results.
[1] is primarily focused on building group equivariant models. In particular, I emphasize that, instead of modeling a distribution over the group $G$ , one would model a distributions over a group $X$ that $G$ acts on (ie $X = \mathbb{R}^{n \times 3}$ and $G = SO(3) \times S_n$ would be a typical setup for molecules) if [1] was a proper VAE instead of an AE. This is extremely different from the proposed work, which models a distribution on $G$ .
[2] is completely different from [1], is more relevant to the proposed work, and, even then, would not be applicable here. [2] learns a proper probabilistic model (ie a VAE + diffusion), which is completely different from the goal of [1]. [2] would not be applicable here either since we would be looking to model a distribution over $X$ that is equivariant with respect to $G$ . If $X = S_n$ , then there is no such $G$ besides the trivial group which would give the required flexibility. Such a framework would then have to continuously embed $|S_n| = n!$ number of elements into the latent space and decode there, which is computationally infeasible since even the most performant VQVAEs (ie discrete quantized representation VAEs, trained through some variant of straight through estimator) can have a codebook size of 1024.

[1] Winter, Robin, et al. "Unsupervised learning of group invariant and equivariant representations." Advances in Neural Information Processing Systems 35 (2022): 31942-31956.

[2] Xu, Minkai, et al. "Geometric latent diffusion models for 3d molecule generation." International Conference on Machine Learning. PMLR, 2023.

评论- Thanks for the support from Reviewer 1BDf

2024-08-09

We thank Reviewer 1BDf for further clarifying the contributions of Winter et al. [1] and for pointing out the unfairness of Reviewer UKH2's comments. The listed points strongly support our explanation that (1) the method in [1] cannot solve the problem we address, and (2) the experiments in [1] are fundamentally different from ours. Therefore, it is unjust for Reviewer UKH2 to criticize our work for not being state-of-the-art compared to [1] and to request further comparison.

[1] Winter, Robin, et al. "Unsupervised learning of group invariant and equivariant representations." Advances in Neural Information Processing Systems 35 (2022): 31942-31956.

评论- I disagree with the authors and Reviewer 1BDf

2024-08-09

I strongly disagree with what you said. Do you understand what is the idea of GeoLDM by Xu et al. ICML, 2023? You can simply replace E(n)-equivariant neural network with S(n)-equivariant neural network for both encoder and denoising neural network to achieve this goal. For instance, you can follow the similar architecture used by Winter et al.

Based on the results in GeoLDM and latent diffusion models, I can see that the latent diffusion can significantly outperform the proposed approach in this paper.

The authors completely ignore my comments. I have to lower my rating.

I am very familiar with related literature and based on my understanding, this paper does not qualify to be published.

评论- Re Reviewer UKH2: Vague and subjective comments do not help discussion

2024-08-10

Comment 1: Based on the results in GeoLDM and latent diffusion models, I can see that the latent diffusion can significantly outperform the proposed approach in this paper.

You cannot simply imagine a method that does not exist in the literature and claim that "it can significantly outperform the proposed approach in this paper." This subjective and unfounded assertion is not only unfair but also goes against the spirit of scientific research.

Comment 2: You can simply replace E(n)-equivariant neural network with S(n)-equivariant neural network for both encoder and denoising neural network to achieve this goal.

Reviewer 1BDf has already highlighted the challenges associated with this type of latent diffusion approach in his 4th point. We expand on it below in detail.

If you embed the symmetric groups into continuous latent variables, you must carefully design a diffusion prior and a decoder to ensure the decoded output is a valid group element and the decoder induces a valid distribution over symmetric groups.
If you embed the symmetric groups into discrete latent variables, you then need to build a discrete diffusion model that can handle $N!$ states -- this is precisely what our work achieves!

In short, either scenario requires a non-trivial model design that has not existed in the literature. It is not as simple as you vaguely imagined.

Comment 3: The authors completely ignore my comments. I have to lower my rating.

We did not ignore your comments. Both Reviewer 1BDf and ourselves provided detailed responses to your questions, clarifying our approach and understanding. In contrast, your comments have remained vague and have not addressed our specific points.

评论- I strongly and completely disagree

2024-08-12

You cannot simply imagine a method that does not exist in the literature and claim that "it can significantly outperform the proposed approach in this paper." This subjective and unfounded assertion is not only unfair but also goes against the spirit of scientific research.

I disagree, how did you conclude that my comment is subjective and unfounded? This approach is actually a course project and has been implemented and has gotten strong results. It is quite easy to do based on the work of Xu et al. ICML 2023.

I believe to adhere to the spirit of scientific research, you need to know the literature. If you do not know, you should listen to reviewers and read the related papers rather than do what you did. You seems to have never read the paper I mentioned during the rebuttal period based on your response.

In short, either scenario requires a non-trivial model design that has not existed in the literature. It is not as simple as you vaguely imagined.

Again, I disagree. You should try this approach - and I think you can implement this in a couple of hours during the rebuttal period but you completely ignored my comments.

What did you say at the beginning? Winter et al's approach and GeoLDM cannot be used to solve the problem under your study, right?

I recommend a strong rejection of this paper. I believe once the authors tried the approach I mentioned, the authors would feel the approach in this paper is rather weak and may not even be worth publishing.

评论- I respectfully disagree

2024-08-12

The key idea of [2] - latent diffusion models for molecule generation - builds on the idea in [1], as pointed out in [2]. It seems [1] is not directly related to this paper, but the approaches proposed in [1] are quite generic. You can perform symmetric diffusion similar to GeoLDM in [2] by parameterizing the encoder and densoising diffusion using an S(n) equivariant neural network. This approach is much more viable compared to this paper in terms of performance, efficiency, and flexibility.

评论- Stepping in again.

2024-08-12

This is, unfortunately, the second time in this thread that I feel the need to step in.

To reviewer UKH2:

Please clarify your prior position and argument. Both the authors and I have stated explicitly that the prior work that you reference does not work in the problem setting of the submitted paper. In particular, the prior work you reference can enable one to learn a $\mathbf{S_n}$ -equivariant probabilistic model while the submitted work learns a probabilistic model on $\mathbf{S_n}$ , which are two very different things and do not overlap in any meaningful way. For example, this means the submitted paper is not actually building a "symmetric diffusion". For clarity sake, I am asking you to directly address which part of this assertion you think is incorrect (e.g. if the statement does not accurately portray your references, the submitted work, or the differences/similarities between the two bolded ideas).
Please make more concrete points. In particular, please do not reference course projects or intuitions about methods or anything else like this that we (authors, reviewers, ACs) are unable to directly access. For example, if UKH2 has played with the provided code in this paper and has tried a baseline method that they have mentioned (which could explain the motivation behind your comments), it would better support your point if you can provide us all with the code through an anonymous github link or something similar.
Please review in good faith. Please do not accuse the authors (or me) of not reading your references. Please do not lower your score to spite the authors when they disagree with you. Please do not insult the submitted work in a dismissive way (e.g. "weak" and "not worth publishing" vs saying something like "does not adequately compare against relevant baselines", which also happens to be a more precise characterization of your critique).

评论- Your (reviewer 1BDf) position and scientific aspects get dubious

2024-08-13

I think my point on how to solve the problem by parameterizing the encoder and denoising process is clear, and the implementation is also clear - replacing E(n)-equivariant neural networks in GeoLDM with S(n)-equivariant neural networks used by Winter et al. The implementation is not challenging at all, and it is not my responsibility to provide code for this. As a reviewer, my role is to properly assess the paper based on my knowledge, though I cannot guarantee completely adequate.

I lowered the score because I think the paper has major flaws after discussing it with the authors.

Let us recall how you violate good faith: recall that I first raise my comments on another solution to solve the problem, and look at how insulting and lack of scientific merits in your first comment. Then I made my point clear on how to solve the problem in detail, then looked at your insulting language in your latest comment.

I think your position is dubious and lacks good faith. My points are more than clear, before the authors showed me reasonable evidence that the proposed approach is better than using the latent diffusion, I insist on rejecting this paper.

I am not going to respond to any of your further unscientific comments.

评论

2024-08-14

Dear Reviewers and Authors,

Thank you all for your engagement in the review process for this submission. I appreciate the time and effort that has gone into providing detailed feedback and responses. However, I am concerned that the discussion has become unproductive and moved away from a constructive scientific dialogue.

At this point, I believe it would be best to refocus our attention on the core scientific merits of the paper. To that end:

I kindly ask all parties to refrain from personal accusations or comments about others' motivations or competence. Let's maintain professionalism and focus on the content of the work.
Reviewer UKH2: I appreciate your perspective on potential alternative approaches. However, for the purposes of this review, we need to evaluate the paper based on its own merits and comparisons to existing published work, rather than hypothetical unpublished methods. If you have specific concerns about the paper's methodology or results, please articulate them clearly in relation to the content of the submission.
Authors: Thank you for your detailed responses. Moving forward, please continue to address scientific points raised by the reviewers to the best of your ability.
Reviewer 1BDf: I appreciate your efforts to clarify points of contention. Please continue to focus on the scientific aspects of the submission in any further comments.
To all: If there are any remaining substantive scientific points that need clarification regarding the submission, please state them concisely. Otherwise, I will proceed with making a recommendation based on the reviews and discussion thus far.

Let's please maintain a collegial and constructive tone as we conclude this review process. I will be carefully considering all feedback provided as I formulate my final recommendation.

Thank you for your cooperation and dedication to the peer review process.

Best regards,

Area Chair

作者回复

2024-08-06

General Replies

We thank the reviewers for their insightful and constructive comments. We have addressed all the questions in the individual responses. Here, we'd like to highlight a common question from the reviewers:

Q: Why can’t we use $q(X_{t-1}|X_t,X_0)$ and the KL divergence form of the variational bound? (Q1 for Reviewer 1BDf; Q2 for Reviewer fFZa)

A: In our work, we optimize the following negative ELBO:

\mathbb{E}\_{p_{\text{data}}(X_0, \mathcal{X})q(X_{1:T} \vert X_0, \mathcal{X})} \left[-\log p(X_T \vert \mathcal{X}) - \sum_{t=1}^{T}\log\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_t|X_{t-1})}\right] \tag{1}

Many diffusion models will write the negative ELBO in the following equivalent form of KL divergences to reduce the variance [2, 3]:

\mathbb{E}\_{p_{\text{data}}(X_0,\mathcal{X})q(X_{1:T}|X_0)}\left[D_{\mathrm{KL}}(q(X_t|X_0)\parallel p(X_T|\mathcal{X}))+\sum_{t>1}D_{\mathrm{KL}}(q(X_{t-1}|X_t,X_0)\parallel p_{\theta}(X_{t-1}|X_t))-\log p_{\theta}(X_0|X_1)\right] \tag{2}

However, we cannot use this objective for $S_n$ in most cases. In particular, since

q(X_{t-1}|X_t,X_0)=\frac{q(X_{t}|X_{t-1})q(X_{t-1}|X_0)}{q(X_t|X_0)},

we can derive the analytical form of $q(X_{t-1}|X_t,X_0)$ if we know the form of $q(X_t|X_0)$ .

However, $q(X_t|X_0)$ is unavailable for most shuffling methods used in the forward process except for riffle shuffles.

For riffle shuffles, $q(X_t|X_0)$ is actually available and permits efficient sampling [1]. However, $D_{\mathrm{KL}}(q(X_{t-1}|X_t,X_0)\parallel p_{\theta}(X_{t-1}|X_t))$ does not have an analytical form, unlike in common diffusion models. As a result, we cannot use mean/score parameterization [2,4] commonly employed in the continuous setting.

One can rewrite the term as follows and resort to the Monte Carlo (MC) estimation,

\mathbb{E}\_{q(X_t|X_0)}\Big[ D_{\mathrm{KL}}(q(X_{t-1}|X_t,X_0)\parallel p_{\theta}(X_{t-1}|X_t)) \Big] = \mathbb{E}\_{q(X_t|X_0)} \mathbb{E}\_{q(X_{t-1}|X_0)} \left[\frac{q(X_t|X_{t-1})}{q(X_t|X_0)}\cdot\log\frac{q(X_{t-1}|X_t,X_0)}{p_{\theta}(X_{t-1}|X_t)}\right].

Note that $X_t \sim q(X_t|X_0)$ and $X_{t-1}\sim q(X_{t-1}|X_0)$ are drawn independently. However, there is a high chance that $q(X_t|X_{t-1})=0$ for the $X_t$ and $X_{t-1}$ that are sampled. Consequently, if we only draw a few MC samples, the resulting estimator will likely be zero with zero-valued gradients, impeding the optimization of the training objective. Our preliminary experiments also verified that the above MC version of the objective leads to slightly poorer empirical performance.

Training Loss (w. Forward Riffle Shuffle)	Sequence Length	Kendall-Tau	Accuracy (%)	Correct (%)
Eq. (1) (loss in our paper)	15	0.932	82.6	94.5
Eq. (2)	15	0.926	80.7	94.2

[1] Dave Bayer and Persi Diaconis. Trailing the dovetail shuffle to its lair. The Annals of Applied Probability, 2(2):294 – 313, 1992.

[2] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

[3] Austin, Jacob, et al. "Structured denoising diffusion models in discrete state-spaces." Advances in Neural Information Processing Systems 34 (2021): 17981-17993.

[4] Song, Yang, et al. "Score-Based Generative Modeling through Stochastic Differential Equations." International Conference on Learning Representations.

评论

2024-08-08

Dear Reviewers,

Now that the rebuttal period has ended, please take a moment to review the authors' responses to your initial reviews. Your engagement is crucial for:

Ensuring a fair evaluation process
Improving the quality of final reviews
Fostering meaningful scientific discourse

If you have any questions or need clarification, please post follow-up comments.

Your continued dedication to this process is greatly appreciated.

Best regards,

AC

最终决定Reject

2024-09-25

This paper presents a discrete diffusion model for learning distributions over finite symmetric groups, with applications to combinatorial optimization. Key innovations include using riffle shuffles for the forward process and a generalized Plackett-Luce distribution for reverse transitions.

Reviewers were divided on the technical novelty, with some appreciating the application to permutations, while others found the advances incremental compared to existing $S_n$ -equivariant generative models. The experiments show competitive performance on tasks like sorting and TSP, but concerns about scalability remain.

For future submissions, I recommend:

Clarifying how this approach differs from and addresses challenges unique to $S_n$ -equivariant models.
Expanding the empirical evaluation to larger problem sizes and additional combinatorial tasks.
Providing a more comprehensive theoretical analysis, especially in relation to continuous-time formulations of discrete diffusion.
Exploring ideas from $S_n$ -equivariant models to enhance the proposed framework.

The work shows promise, but substantive revisions are needed to establish its significance. I encourage the authors to continue developing these ideas.