PaperHub
6.0
/10
Poster4 位审稿人
最低6最高6标准差0.0
6
6
6
6
3.3
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

Identifiable Object-Centric Representation Learning via Probabilistic Slot Attention

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We propose a method to learn identifiable object-centric representation up to a proposed equivalence relation.

摘要

关键词
Object-centric learningProbabilistic slot-attentionIdentifiabilitylatent mixture models

评审与讨论

审稿意见
6

This paper studies the theoretical identifiability of object-centric representations (slots) in object-centric learning (OCL). Prior works study the same problem, but are limited to OCL models with an additive decoder. This work relaxes the constraint to non-additive decoders, which have shown important to scale up to more complex data in recent OCL works (e.g., TransformerDecoder, Diffusion Model). To tackle this generalized setting, authors propose Probabilistic Slot Attention (PSA), which applies the Gaussian Mixture Model (GMM) to produce slots from each data sample (e.g. an image). The authors prove the effectiveness of PSA both theoretically and empirically, using results on both low-dimensional data and 2D images.

优点

  • This paper tackles an important problem -- what assumptions are required to learn identifiable object slots?
  • The paper is generally clearly written and well presented.
  • I appreciate the efforts in experimenting with common OCL image datasets such as CLEVR and ObjectsRoom. They are missed in previous identifiable OCL works.
  • The experimental results with PSA and PSA-Proj (both using additive decoders) are solid.

缺点

I have only one big concern. However, this concern is closely related to the main contribution of this paper (correct me if I am wrong), and I cannot accept the paper if it is not well addressed.

  • This paper claims to be the first work learning identifiable slots with a non-additive decoder (NoA). This is great, as recent OCL works show impressive results using a Transformer-based or a diffusion-based decoder.
  • However, the paper does not really have experimental results supporting this claim. PSA-NoA seems to underperform PSA-Proj consistently on all datasets & metrics.
  • What's worse, even compared to vanilla SA, PSA-NoA still underperforms in FG-ARI (Table 3), and even on some identifiability-related metrics (Table 2). These results are concerning to me, as we are not sure if the proposed algorithm can really scale to recent OCL models with better decoders.

Minor:

  • Please unify the citation format -- for papers that are published at conferences, please use their conference version, e.g., [5] (ICML), [45] (NeurIPS).
  • In Sec.4, the paper claims that PSA can dynamically determine the number of object slots. While Fig.10 in the Appendix shows a few results, I don't think that's enough to claim PSA "offers an elegant solution to this problem". [1] is a recent work that studies this problem and provides in-depth analysis. The authors can conduct experiments following [1]'s setting. But I understand that the dynamic slot number is not the main contribution of this work.
  • In line 51, the authors claim that the computation cost of non-additive decoders is invariant to the number of slots K. Is this true? In recent Transformer-based and diffusion-based decoders, they use cross-attention to condition the reconstruction on slots, and the computation cost of attention is quadratic to the token size, i.e., K in this case.

[1] Fan, Ke, et al. "Adaptive Slot Attention: Object Discovery with Dynamic Slot Number." CVPR. 2024.

问题

  • Can the authors apply PSA to a better non-additive decoder? For example, the Transformer decoder proposed in [61]. Otherwise, it is hard to assess the contribution of this work.
  • What is the implementation detail of the non-additive decoder "standard convolutional decoder"? How to decode an image from K 1D slots using a CNN decoder?

Minor:

  • How can we apply the theory proposed in this paper to more complex datasets, e.g., real-world COCO images? This is not required for a theory paper, but I'm curious about the authors' thoughts.

局限性

The authors have adequately addressed the limitations in Sec.7.

作者回复

We thank the reviewer for their detailed comments and constructive feedback. We appreciate the fact that our paper was considered to be clearly written and well presented, and we are glad that our results are perceived to be solid.

"However, the paper does not really have experimental results supporting this claim..."

We respectfully disagree that the results do not support this claim, PSA-NoA and PSA-Proj both include the probabilistic latent setting as proposed in our work. Notably, there is a known trade-off between identifiability and expressivity induced by the choice of decoder structure [45]. Depending on the use case, it may be beneficial to combine both latent and additive decoder structures in practice, particularly if the latter introduces useful inductive biases and/or simplifies the optimization problem. In our experiments, we observed that when using PSA in tandem with an additive decoder it is possible to outperform all other baselines. As for non-additive decoder-based experiments, PSA-NoA must be compared directly against SA-NoA for it be a fair comparison, and we observed it to perform better than SA-NoA while remaining competitive with the remaining baselines.

"Can the authors apply PSA to a better non-additive decoder? For example, the Transformer decoder proposed in [61]. Otherwise, it is hard to assess the contribution of this work."

We agree that the applicability of our framework on large-scale datasets is a crucial evaluation of our theoretical results, but that does not reduce the contribution of our work, which is primarily theoretical. The main focus of this work is to investigate the theoretical identifiability of slot representations and the conditions that ensure this property, rather than provide state-of-the-art results on large-scale datasets. To verify our theoretical claims, we first conduct detailed experiments on controlled datasets, and then extend our demonstrations to unstructured image data. We stress that the synthetic datasets we used are necessary for properly testing our identifiability hypotheses.

One of the main assumptions necessary to prove slot identifiability in our setting is weak injectivity, which is achieved when we use piece-wise linear decoders. In the case of transformer decoders, this assumption is not guaranteed to hold because of the complexity of the attention mechanism (further theory is required here). With that said, we have conducted additional empirical evaluations on the identifiability of slot representations obtained with more complex transformer decoders, which result in an SMCC of 0.73±0.04\mathbf{0.73 \pm 0.04} and R2 of 0.55±0.06\mathbf{0.55 \pm 0.06} on the CLEVR dataset, which is significantly better than SA and all other baselines. For more details and additional experiments please see the general comment at the top.

"What is the implementation detail of the non-additive decoder "standard convolutional decoder"? How to decode an image from K 1D slots using a CNN decoder?"

Thank you for pointing this out as missing as it escaped our attention. We simply concatenate all K slots together and upscale the resolution using four transposed convolutions until we reach the image resolution, we also use Leaky-ReLU activations. The architecture is very similar to the one used in the original SA work, we will be sure to add the remaining details to the paper, thanks again.

"How can we apply the theory proposed in this paper to more complex datasets, e.g., real-world COCO images? This is not required for a theory paper, but I'm curious about the authors' thoughts."

We have included experiments on PascalVOC2012 datasets and also tested our model with more complex decoders, please refer to the general comment above for details and additional results.

评论

I thank the author for the rebuttal. The experiments on the Transformer-based slot decoder and the large-scale experiments on real-world datasets are extensive and strong. My main concerns are addressed. Now I recommend acceptance of the paper and have adjusted my score to Weak Accept.

审稿意见
6

This paper addresses the problem of identifiability of object-centric representations. In contrast to prior works which achieve identifiability via assumptions on the generator, this paper explores identifiability via assumptions on the latent distribution. To do this, the authors introduce a probabilistic variant of the popular Slot Attention algorithm and prove theoretically that this method identifies the ground-truth object representations. The authors verify their theory on toy data as well as test their method on high dimensional image data, showing improved performance over baseline methods.

优点

  • The authors address an important problem in representation learning. Namely, understanding when learning object-centric representations is theoretically possible.

  • The paper is well written, well structured, and, in general, easy to understand.

  • The authors position their contribution well within the broader representation learning literature and provide a good review of prior work.

  • Section 4 is well written. I particularly appreciated being able to look at Algorithm 1 while reading the section to guide my intuition.

  • Exploring probabilistic constraints for identifiability in object-centric learning is an important problem to understand, thus I think the author’s contribution is of interest for the representation learning community.

  • The authors achieve identifiability by proposing an adaptation to a widely used method, making their method potentially easy to adopt in practice.

  • The authors conduct a solid empirical study and achieve promising results, in particular when coupling their method with structured decoders.

  • The figures in the manuscript are conceptually helpful and aesthetically well done.

缺点

Paper Positioning/Storyline

One of the main issues I have with this work is that I do not think that the paper’s storyline i.e. how the authors motivate and position their contribution, accurately reflects the actual contributions of this work. As I understand, the current pitch of the paper is the following:

Previous works from [1, 2], prove identifiability of object-centric representations by making assumptions on the decoder. Enforcing the assumptions in [1] is not tractable in practice, however, due to scalability issues with the compositional contrast in [1]. In this work, we remedy this by exploring probabilistic constraints for identifiability which yield identifiability theoretically and empirically but do not suffer from the same empirical scalability issues as prior works.

I think this pitch is problematic for the following reasons:

Firstly, it is important to note that the compositional decoders explored in [1] were proven to be a subset of the additive decoders explored in [2]. Consequently, if the ground-truth decoder is compositional and one uses an additive decoder for the inference model, then assuming the assumptions from [2] are met, the inference model is slot identifiable i.e. it will implicitly minimize compositional contrast.

This does not completely dismiss the scalability issues noted by the authors, since as mentioned in Lines 51-52, additive decoders also suffer from some scalability issues. I think, however, that the author’s current discussion of scalability, in particular wrt [1], misses the key nuance discussed above. I would suggest the authors incorporate this discussion into the paper by altering their writing and positioning of their contribution accordingly.

Secondly, the storyline and writing in the manuscript give the impression that one could dispense with decoder structure all together in favor of probabilistic structure on the latent space. As the authors show empirically in Section 6, however, this is not exactly the case. While probabilistic structure gives identifiability gains relative to baselines, without incorporating decoder structure, identifiability drops non-trivially across all metrics.

Moreover, one of the core motivations for object-centric learning is learning representations which generalize compositionally [2, 3]. Such compositional generalization is only possible through decoder structure where additivity is one such structure (see [3] Section 2.). If one only focuses on enforcing structure on the latent distribution, it is not clear to me how such compositional generalization can be achieved.

With all of this being said, I think a more accurate and superior pitch for the contributions in this work should focus on the advantage of using probabilistic and decoder structure in tandem opposed to suggesting that probabilistic structure is somehow superior. Something like:

In this work, we show how identifiability can be achieved via probabilistic constraints on the latent space. We show how such constraints can be naturally and tractably incorporated into Slot Attention. We verify our theory and method on toy data. We then show on image data that our probabilistic structure leads to improved identifiability over unstructured baselines. Furthermore, when coupled with structured decoders, our method yields performance which outperforms both probabilistic structure and decoder structure in isolation.

For the reasons stated above, I would encourage the authors to alter their writing in the manuscript to adhere closer to a storyline along the lines of the one presented above. As things currently stand, the messaging in the paper feels a bit misleading and over-claiming.

\newline

Theory Explanation

I would have appreciated a short paragraph giving some intuition on how exactly the probabilistic constraints imposed by the authors sufficiently restrict the problem such that slot identifiability is possible. Specifically, unsupervised identifiability via probabilistic structure is challenging and, in most cases, impossible [4., 5.]. Thus, I think it would be helpful if the authors could explain which structure in their method is key towards overcoming these well known unidentifiability issues. For example, is it the permutation invariance, the complex aggregate posterior etc.?

\newline

Metrics

The authors use two main metrics to validate identifiability: the slot identifiability score (SIS) from [1] and a new metric, slot MCC (SMCC). I found the authors explanations of the differences in these metrics unclear, which made it difficult to interpret some of the empirical results. My understanding is that SMCC is just a linear/affine version of SIS in terms of the predictor fit between latents. Thus, it is a bit unclear to me why the values for SIS should be so much lower than SMCC in the experiments on toy data in Section 6.

\newline

Experiments

The authors experiments on image data focus primarily on simple decoders opposed to e.g. Transformers and visually complex datasets. To assess the scalability of the author’s method it would be important to test the method on more complex models/datasets. I view this as a more minor weakness of this work, however, given that the contribution is largely theoretical.

\newline

Appendix

In section A. of the Appendix, the authors review some definitions from prior works e.g. in [1]. These definitions, however, are presented in an informal, imprecise way. I would encourage the authors to be precise in this section if they are going to include definitions from prior works.

\newline

Bibliography

  1. Provably Learning Object-Centric Representations (https://arxiv.org/abs/2305.142290)

  1. Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation (https://arxiv.org/abs/2307.02598)

  1. Provable Compositional Generalization for Object-Centric Learning (https://arxiv.org/abs/2310.05327)

  1. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations (https://arxiv.org/abs/1811.12359)

  1. Variational Autoencoders and Nonlinear ICA: A Unifying Framework (https://arxiv.org/abs/1907.04809)

问题

  • Do the authors view their method as being a replacement for structured decoders in object-centric learning or do they view the method as being better used in tandem with structured decoders?

  • What is the key theoretical assumption/constraint which allows for identifiability to be possible?

  • What are the key differences between SMCC and SIS?

  • Do the authors have an explanation for the low SIS score on the toy data experiments?

  • Does “R2” refer to SIS when used or is this a different score?

  • Do the authors have any intuition about how well their method would perform for more complex models/datasets?

局限性

The authors include a limitation section which discusses some limitations of their theory. I would encourage the authors to discuss some limitations in their experiments, as discussed above in the “Weaknesses” section. I would also include some discussion of the limitations of purely probabilistic structure for object-centric learning as it pertains to compositional generalization, as discussed above.

作者回复

We thank the reviewer for their very detailed comments and constructive feedback which helped improve our paper significantly! We also appreciate the fact that our paper was perceived to be well-written, easy to understand, and of interest to the community.

"One of the main issues I have with this work is that I do not think that the paper’s storyline ..."

We thank the reviewer for their very insightful and constructive suggestion regarding the messaging of the paper. We largely agree with the overall sentiment and will incorporate the suggested changes which will significantly improve the paper’s pitch. To clarify, it is not our intention to compete with or replace additive decoders as a model choice since they undoubtedly provide useful inductive biases for object-centric learning as ours and many other previous works show.

We further remark that before our work there was a lack of explanatory theory for why state-of-the-art results were able to be obtained using non-additive autoregressive Transformers (DINOSAUR [60]) and/or diffusion-based decoders (Slot-Diffusion [r2]). We showed that by viewing slot attention through a probabilistic graphical modelling perspective it is possible to prove slot identifiability for non-additive decoders using proof techniques from identifiable generative modelling. Given that vanilla slot attention can be seen as a simplified version of probabilistic slot attention, akin to the relationship between soft k-means and GMMs, our theoretical results suggest why non-additive decoder structures can work well given the appropriate latent structure and inference procedure are in place. With that said, there is an identifiability and expressivity trade-off [45] induced by the choice of decoder structure, so depending on the use case it may indeed be advantageous to combine both latent and additive decoder structure!

"I would have appreciated a short paragraph giving some intuition on how exactly the probabilistic ..."

We would like to point out that we do discuss the intuition of our identifiability results in depth in the appendix, just before the proofs. As opposed to [4, 5], we use a more expressive GMM latent distribution followed by a weakly injective piece-wise linear decoder which in combination ensures that the slot representations are identifiable.

"The authors use two main metrics to validate identifiability: the slot identifiability score (SIS) from ..."

SIS is the relative R2 (coefficient of determination) score-based measure, capturing the relative ratios of variance for a dependent variable explained by an independent variable with baseline model scores that are trained on the slot representations of a dynamically updating model. As for SMCC, it measures the linear/affine relationship between permuted slots and features. We provide more detailed explanations in appendix F.

"The authors experiments on image data focus primarily on simple decoders opposed ..."

Although we agree that assessing the performance of probabilistic slot attention on more complex datasets would be useful, the controlled scenarios and datasets we used are necessary for properly testing our theoretical identifiability hypotheses, which is the objective of this work.

Regarding scalability concerns of probabilistic slot attention (PSA), we stress that probabilistic slot attention PSA retains the O(TNKD)\mathcal{O}(TNKD) computational complexity of vanilla slot attention and as such enjoys the same scalability properties. We did evaluate our model on large-scale datasets and with more complex decoders, we discuss them in the general comment.

"Do the authors view their method as being a replacement for structured decoders in object-centric learning or do they view the method as being better used in tandem with structured decoders?"

From a theoretical identifiability viewpoint, latent distributional structure is a sufficient requirement as long as the decoder is piece-wise linear. However, additive decoders provide strong, useful inductive biases and may be easier to optimize relative to a probabilistic model with an arbitrary decoder. In our experiments, we observed that using our approach in tandem with an additive decoder tends to outperform other models.

"Do the authors have an explanation for the low SIS score on the toy data experiments?"

SIS was proposed by [1] and it uses the relative R2 scores, where baseline models are trained on the slot representations of a dynamically updating model. Due to this, it tends to exhibit high variability as illustrated in Appendix F (we use the official implementation on GitHub for all our analysis). We believe this instability in the baseline model, which is also pointed out by others on their GitHub repository, could be due to the validation dataset size resulting in lower scores.

"Does “R2” refer to SIS when used or is this a different score?"

R2 is the coefficient of determination, while SIS is a relative R2 score. R2 is propositional to SIS but due to the instability of SIS, we use R2 for all imaging experiments.

"Do the authors have any intuition about how well their method would perform for more complex models/datasets?"

The method can be scaled to large-scale datasets using the approaches outlined in [60], with convolutional or transformer-based decoder since the computational complexity is the same as vanilla slot attention (O(TNKD)\mathcal{O}(TNKD)). Please refer to the general comment for results on complex datasets using more powerful decoders.

[r2] Wu, Z., Hu, J., Lu, W., Gilitschenski, I. and Garg, A., 2023. Slotdiffusion: Object-centric generative modeling with diffusion models. Advances in Neural Information Processing Systems, 36, pp.50932-50958.

评论

I thank the authors for their reply.

\newline

"One of the main issues I have with this work is that I do not think that the paper’s storyline ..."

I appreciate the authors acknowledgement of my concerns and look forward to reading an updated version incorporating this feedback.

Regarding the connection between slot attention and Transformer’s success in object-centric learning:

If this is the messaging of the paper that the authors wish to convey, then I would encourage them to rewrite the introduction accordingly. This messaging was not clear to me from reading the paper. Moreover, I am not completely convinced by this explanation given that decoder structure does indeed play an important role empirically. Thus, I find it more likely that the success of Transformers is due to a combination of probabilistic and decoder structure (i.e. the inductive biases of the Transformer). In other words, I do not think the authors results on their own provide a complete explanation of Transformer's success in object-centric learning but may provide some evidence to this end.

\newline

I would have appreciated a short paragraph giving some intuition on how exactly the probabilistic …

Thank you for the paragraph reference. Based on this, I would like to make a related point on the clarity of the theory. Namely, I do not think that the authors sufficiently highlight the piece-wise linear structure needed on the decoder for their theoretical result.

Up until page 7, the paper gives the idea that no decoder structure is needed for the theory. For example, the authors state in Section 2 that their theoretical contribution falls into the category of “(iii) imposing structure in the latent space via distributional assumptions.” This is not exactly correct, as the piece-wise linear structure is indeed decoder structure. I understand that the authors may view this structure as more easily implemented than e.g. additivity, and thus possibly less noteworthy. However, this piecewise linear structure is a key aspect of the theoretical contribution of this paper, and moreover, is important for contextualizing the authors theoretical contribution relative to prior identifiability results. Thus, I think this structure should be mentioned a bit more transparently in the introduction, and discussed with more precision in the related work. Otherwise, the messaging of the paper once again feels misleading.

\newline

"The authors use two main metrics to validate identifiability: the slot identifiability score (SIS) from ..."

I appreciate the authors reply on this point. I am curious, however, what the procedure was to resolve the matching problem when computing SMCC. Specifically, slots are locally permutation invariant opposed to globally due to the local permutation invariance of the decoder. Thus, how did the authors resolve this local permutation invariance when computing SMCC?

评论

We thank the reviewer for engaging with us and for providing feedback.

“If this is the messaging of the paper that the authors wish to convey, then I would encourage them to rewrite the introduction accordingly…”

We are glad to hear that and look forward to incorporating the suggested changes, thanks again.

To clarify, our discussion of slot attention and Transformers is in response to the reviewers' requests and does not represent a change in the core message of the paper. We also acknowledge that further theory is required when using non-additive Transformer-based decoders as technically the weak injectivity property our proofs rely upon is not known/guaranteed to hold for Transformers due to the complexity of the attention mechanism (c.f. response to Reviewer HxYt). We believe that extending our theoretical results by relaxing the weak injectivity decoder assumption offers a promising direction for future research.

“I find it more likely that the success of Transformers is due to a combination of probabilistic and decoder structure (i.e. the inductive biases of the Transformer).”

We generally agree as there is a known trade-off between identifiability and expressivity induced by the choice of decoder structure [45]. As such it may be beneficial to combine both latent and decoder structures, particularly if the latter introduces useful inductive biases and/or simplifies the optimization problem. In our experiments, we observe that the combination of both typically yields better results.

“I do not think that the authors sufficiently highlight the piece-wise linear structure needed on the decoder for their theoretical result ”

We understand the reviewer's concern but respectfully disagree with their conclusion. Our theoretical results show that the decoder additivity constraint is not required if the decoder is piecewise linear and the latent space is GMM distributed. Although any decoder possesses some structure in terms of its architecture, an MLP decoder with LeakyReLU activations (satisfying weak injectivity) does not impose structure in the same sense as an additive MLP decoder, as the latter is a stronger restriction on the functional class and departs from the standard MLPs commonly used outside of object-centric learning. We will emphasize the weak injectivity assumption and the implications of piecewise decoders earlier in the introduction.

“How did the authors resolve this local permutation invariance when computing SMCC?”

As detailed in Appendix F, we used Hungarian matching to resolve this when computing SMCC.

Additionally, we believe our new experiments on PascalVOC address the reviewer’s concerns regarding the scalability of probabilistic slot attention.

评论

Thank you for the reply!

\newline

“I do not think that the authors sufficiently highlight the piece-wise linear structure needed on the decoder for their theoretical result ”

Which conclusion is being disagreed with here? Decoder structure is assumed in the theory. I presume we agree on this? I agree, as stated, that this is weaker structure than additivity, however, it is stronger than the decoder being a diffeomorphism, which is generally all that is assumed in identifiability results which “(iii) impose structure in the latent space via distributional assumptions.”. Therefore, I do not think it makes sense for the authors to group their theoretical contribution in this category. I hope this is more clear now.

\newline

“How did the authors resolve this local permutation invariance when computing SMCC?”

Apologies if my question was unclear. My concern is not on scalability. I will rephrase the question the following way: Do the authors agree that the permutation ambiguity between ground-truth slots and inferred slots is "local" i.e. can change for each data point? If so, then do they agree that a different Hungarian matching problem needs to be solved for every datapoint opposed to a "global" matching as is typically done in disentanglement? If so, then how was this "local" matching problem solved?

评论

We apologise for the confusion.

“Which conclusion is being disagreed with here? Decoder structure is assumed in the theory. I presume we agree on this? I agree, as stated, that this is weaker structure than additivity, however, it is stronger than the decoder being a diffeomorphism...”

We agree that piecewise decoder structure is assumed, but we stress that it is a weaker assumption than both additive and diffeomorphic decoders and materializes as e.g. standard MLPs with LeakyReLU activations. Diffeomorphic decoders assume bijectivity of the mixing function, whereas the piecewise decoders we use need only be weakly injective for our proofs.

To improve clarity, we will adjust the relevant sentence in the paper (Line 76) to read: “In this work, we prove an identifiability result via strategy (iii) but within an object-centric learning context, where the latent variables are a set of object slots [50], and piecewise linear mixing functions are employed.” We will also make it clearer earlier on in the introduction that piecewise decoders are necessary for our theoretical results.

“how was this "local" matching problem solved?”

Yes as detailed in Appendix F, we apply Hungarian matching for every data point across the estimated slots.

To clarify, our previous statement about scalability was not related to this question but a general reminder.

评论

Thank you for the prompt reply!

\newline

“How did the authors resolve this local permutation invariance when computing SMCC?”

My apologies if I am missing something here, however, I have read appendix F a few times now and in the past, and am still a bit confused.

When you conduct Hungarian matching "locally" at every datapoint, you must be matching slots based on some criteria. What is this criteria?

评论

We apologise for the confusion; we use the Euclidean distance to match slot representations, resulting in a RK×K\mathbb{R}^{K \times K} cost matrix C, which we then use to solve a linear sum assignment problem as is done in Hungarian matching. For more details, we will be providing an implementation at a later stage.

评论

Thank you for the reply!

I would encourage the authors to include these details in the appendix opposed to just a code implementation. As far as I know, matching based on Euclidean distance is non-standard. In future iterations of this work, the authors can consider including experiments to test the effectiveness of this matching protocol compared to other methods such as matching based on slot-wise mask, or matching based on R2 score (determined in an online fashion).

评论

Thanks for the great suggestion, an in-depth study of the effects of the distance function in the matching algorithm is definitely valuable future work. We will expand Appendix F, with more details about the metric and its implementation.

评论

Great! I thank the authors for their engagement! I will keep my score, thus recommending acceptance. Due to the issues regarding the messaging of the paper, however, I do not feel comfortable increasing my score any higher.

评论

Thank you for your thoughtful consideration and for recommending our paper for acceptance. We very much appreciate your engagement and valuable feedback.

Concerning the perceived issues with the paper’s messaging, we believe these can be easily addressed given that we have reached an agreement in our discussion. Below is a summary which we will use to inform the necessary minor edits to the final version.

  1. If the ground-truth decoder is compositional, and we use an additive decoder [2] then we have slot identifiability given the “sufficient non-linearity” assumption is met.

  2. This means that compositional contrast [1] is implicitly minimized, bypassing the major scalability concerns of minimizing compositional contrast as part of the loss function to guarantee slot identifiability (as observed empirically).

  3. Further, we note the following two important facts:

    • Additive decoders scale linearly in the number of slots K, so some less significant scalability issues remain relative to state-of-the-art non-additive decoders (e.g. using Transformers).

    • The additive decoders studied by [2] are not expressive enough to represent the “masked decoders” typically used in object-centric representation learning, which stems from the normalization of the alpha masks. This means some care must be taken in extrapolating the results in [2] to the models we use in practice.

In this work, we show how slot identifiability can be achieved via probabilistic constraints on the latent space and piecewise decoders. These piecewise decoders manifest as e.g. standard MLPs with LeakyReLU activations and are generally less restrictive than additive decoders. When coupling probabilistic and additive decoder structures, we observe further performance improvements relative to either one in isolation.

Thanks again,

The Authors

审稿意见
6

Solving the problem of identifiability is necessary to find consistent and interpretable factors of variation. There are two approaches to do so: a) place restrictions on the decoders, and b) impose distributional constraints on the latent space. This work takes the second approach and aims to impose a GMM on the latent space. The paper does this by proposing a modification to the vanilla slot attention framework that they call probabilistic slot attention. Under this framework the papers shows theoretical and empirical identifiability of slots.

优点

  • The proposed probabilistic slot attention framework is an intuitive extension of the vanilla slot attention with the updates in each iteration resembling the familiar EM algorithm.
  • The proposed framework offers a possible solution to the problem of dynamically determining the required number of slots.
  • This is the first work to experiment with imposing a distribution on the latent space where prior works either focus on the generator or the decoder.
  • Recovering the latent space upto an affine transformation is shown in the synthetic modelling scenario with theoretical guarantees provided.
  • The paper studies an important premise – the theoretical understanding of object-centric representations. The paper is written well, with clear motivations for the proposed contributions.

缺点

  • The paper states that the framework allows for tractably sampling from the aggregate posterior distribution and using it for scene composition tasks; however, this is not empirically qualified anywhere.
  • Experiments only validate the theory on simple synthetic datasets. Testing on more diverse and realistic data would better demonstrate applicability, though the evaluation would also be more challenging. Generally speaking, I would be concerned about the scalability of such an approach leveraging GMMs. I understand however that the objective of this work is theoretically study the identifiability of object-centric representations under less-constrained settings as compared to previous work. Maybe, having a short discussion on how such an approach can be scaled would be nice to see.
  • Table 1 lists β\beta-disentanglement and weak injectivity as core assumptions. While the former is common to all other related methods, weak injectivity is newly introduced. The implications of this assumption is hence important, but is missing in the paper.
  • In fig. 3, it is not clear what experiment the latents (x-axis) correspond to.
  • It might be helpful to visit slot-identifiability in the related work section, considering its literature is the most closely related to this work.
  • Adequate details about the encoder and decoder in the synthetic modelling scenario have not been provided.

Minor/editorial

  • Typo in 769, “Obsereved” -> observed; Typo in 234, “emphirically” -> empirically
  • Some references are repeated. Eg. 36-37. Please check carefully across the full list.

问题

  • Could you elaborate on assumption 6 and why it is not necessary in the current work?
  • In Algorithm 1, line 6, where attention AnkA_{nk} is calculated, mean is calculated as Wqμ(t)kW_q \mu(t)_k but variance is not calculated as Wqσk2W_q \sigma_k^2 ?
  • Should we not return π(t)\pi (t) at the end of the algorithm?
  • The work explicitly mentions that having additive decoders is not a necessity of the current work, but any experimental results are shown only with additive decoders. Perhaps it may be enlightening to see experiments akin to [45] for non-additive decoders (transformer based auto-regressive decoders). There are no results to go along with the choice of a convolutional decoder as the non-additive variant which is mentioned in L277. Did I miss something here?
  • R2 score is reported in Sec 6 without definition. I assume this is the correlation? This is defined as MCC in the paper.

局限性

The statistical distributional assumption on the latent space excludes for identification of any causal dependencies between objects which could be made explicit in the “Limitations and Future work” section.

作者回复

We thank the reviewer for their thoughtful, detailed comments and constructive feedback. We greatly appreciate the positive outlook and the fact that our work was found to be well-written, well-motivated and novel.

"The paper states that the framework allows for tractably sampling from the aggregate posterior ..."

We indeed show that it is possible (Lemma 1) to use the aggregate posterior for compositional tasks, but we felt that it does not meaningfully add to the paper's main theoretical identifiability contributions. We have now included some preliminary results for demonstration purposes (see pdf). Performing a more comprehensive compositional analysis of the aggregate posterior is certainly valuable and warrants dedicated investigation in future work.

"Experiments only validate the theory on simple synthetic datasets. Testing on more diverse and realistic ..."

As rightly pointed out by the reviewer, this work aims to study theoretical identifiability of slot representations and the conditions that ensure this property, rather than provide state-of-the-art results on large-scale datasets. To verify our theoretical claims, we first conduct detailed experiments on controlled datasets and then extend our demonstrations to unstructured image data. We stress that the synthetic datasets we used are necessary for properly testing our identifiability hypotheses.

Regarding scalability concerns of probabilistic slot attention (PSA), we emphasise that PSA retains the O(TNKD)\mathcal{O}(TNKD) computational complexity of vanilla slot attention, where TT denotes the number of attention iterations, NN the number of input vectors, KK the number of slots and DD the slot/input dimension. The additional operations we introduce for calculating slot mixing coefficients and slot variances (under diagonal slot covariance structure) have complexities of O(NK)\mathcal{O}(NK) and O(NKD)\mathcal{O}(NKD) respectively, which do not alter the dominant term. Furthermore, when used in conjunction with additive decoder-based models, PSA can reduce computational complexity by pruning inactive slots via automatic relevance determination (ARD) as outlined in Section 4.

Finally, we have now demonstrated the applicability of PSA to transformer-based decoders - please refer to the general comment above for details.

"Table 1 lists -disentanglement and weak injectivity as core assumptions. While the former ..."

We have included a discussion on the weak injectivity assumption in the remark just below it - we’ll also include a similar discussion in the main text. In summary, weak injectivity ensures that a mixing function fdf_d: (i) in a small neighbourhood around a specific point x0Xx_0 \in \mathcal{X} is injective – meaning each point in this neighbourhood maps to exactly one point in the latent space Z\mathcal{Z}; and (ii) while fdf_d may not be globally injective, the set of points in X\mathcal{X} that map back to an infinite number of points in Z\mathcal{Z} (non-injective points) is almost non-existent in terms of the Lebesgue measure on the image of Z\mathcal{Z} under fdf_d. This assumption is generally satisfied when using Leaky-ReLU networks with randomly initialized weights (Appendix C).

"In fig. 3, it is not clear what experiment the latents (x-axis) correspond to."

Our apologies for the confusion. Figure 3 is a simple illustrative example of an aggregate Gaussian mixture density, it is there to provide the reader with a conceptual intuition and does not correspond to an experimental setting.

"Adequate details about the encoder and decoder in the synthetic modelling scenario have not been provided."

We thank the reviewer for pointing this out as it escaped our attention. We have now added the architectural details for all our models in the appendix - all the code will be made available also.

"In Algorithm 1, line 6, where attention A is calculated, mean is calculated as WqμW_q\mu but variance is not calculated as Wqσ2W_q \sigma^2?"

This suggestion is a valid design choice if either the weights WqW_q are constrained or we use an activation function to ensure all entries in Wqσ2W_q \sigma^2 remain positive. However, since σ(t)2\sigma(t)^2, at attention iteration tt, already has an indirect dependency on WqW_q through μ(t1)\mu(t-1) we omit this style of projection of the variances for simplicity.

"The work explicitly mentions that having additive decoders is not a necessity of the current work, but any ..."

This is unfortunately incorrect as our "NoA" model variants do in fact correspond to a non-additive convolutional decoder as stated in Section 6. Switching these with more powerful autoregressive transformer-based decoders would possibly improve the results but would constitute an unfair comparison with our baselines. For new experiments please refer to the general comment.

"R2 score is reported in Sec 6 without definition. I assume this is the correlation? This is defined as MCC in the paper."

R2 score is a coefficient of determination, it is proportional to correlation, which can be empirically observed even in our experiments. SIS score as introduced in [5] is a relative measure of R2.

"Should we not return \pi at the end of the algorithm?"

Yes thanks for pointing this out, we have now corrected it.

"Could you elaborate on assumption 6 and why it is not necessary in the current work?"

Thanks, we will add a remark explaining this in the paper. Object sufficiency is crucial when learning grounded object representations [43]. Here we do not focus on grounding so strict object sufficiency is technically not required.

[r1] Yao, W., Sun, Y., Ho, A., Sun, C. and Zhang, K., 2021. Learning temporally causal latent processes from general temporal data. arXiv preprint arXiv:2110.05428.

评论

Dear Reviewer 27BN,

As the author-reviewer discussion period is soon coming to a close, we kindly ask the reviewer to take the opportunity to engage with us. We sincerely appreciate the time and effort the reviewer has already contributed to the review of our work and hope our thoughtful rebuttal addresses your concerns.

Best wishes, The Authors

评论

I thank the authors for the responses, and the new results in the common response. Please find my responses below:

  • In the new results in the common response, it appears that the proposed method uses the training strategies of SPOT. The baselines could also benefit from this, after all. Shouldn't this be the fair comparison?
  • Since PSA Transformer is included, a natural comparison is that of SA Transformer. Is there a reason why this was not included?
  • I appreciate the qualitative results on PASCAL VOC.
  • (Minor) In the attached pdf, compositional generation is shown using PSA, it would have been nice to see qualitative comparison with other SA-based efforts that allow compositional generation. I understand this is nitpicky, considering the limited space. But this would have been useful for completeness.
  • While I understand the computational complexity discussion, I would have liked to see wall-clock times of training with PSA as opposed to vanilla SA, at least in approx values.

Having said the above, I do see the strengths of the paper mentioned in my original review, and stay with WA as my decision at this time.

评论

We thank the reviewer for engaging with us and for the feedback.

"In the new results in the common response, it appears that the proposed method uses the training strategies of SPOT. The baselines could also benefit from this, after all. Shouldn't this be the fair comparison?"

We do have a fair baseline comparison as all the methods we trained used the same strategies, please refer to rows SA MLP (w/DINO) and SA MLP (w/DINO)^\ddagger.

"Since PSA Transformer is included, a natural comparison is that of SA Transformer. Is there a reason why this was not included?"

As explained in the general comment, we were previously not able to complete the PSA Transformer training run (\approx 15K steps) due to time constraints so it would not have been fair to compare directly with a fully trained SA Transformer (250K steps [60]). The main point was to show that PSA is scalable and using a more powerful Transformer decoder outperforms the MLP variants. Please find the updated results for SA and PSA Transformers on the PascalVOC dataset (SA Transformer results are based on our reimplementation of the DINOSAUR strategy).

ModelmBOi\text{mBO}_imBOc\text{mBO}_c
DINOSAUR Transformer [60]0.440.512
Ours:
SA Transformer (w/ DINO)0.4270.503
SA Transformer (w/ DINO)^\ddagger0.4400.512
PSA Transformer (w/ DINO)0.4360.515
PSA Transformer (w/ DINO)^\ddagger0.4470.521

"I appreciate the qualitative results on PASCAL VOC."

We are glad to hear that!

"(Minor) In the attached pdf, compositional generation is shown using PSA, it would have been nice to see qualitative comparison with other SA-based efforts that allow compositional generation. I understand this is nitpick, considering the limited space. But this would have been useful for completeness."

We emphasize that compositional generation is not the focus of the paper but is a byproduct of our theoretical framework which would be interesting to explore further in future work.

"While I understand the computational complexity discussion, I would have liked to see wall-clock times of training with PSA as opposed to vanilla SA, at least in approx values."

No problem, please find below the training iteration speeds for both models on PascalVOC with a single RTX 3090:

  • SA runs at 2.31 iterations per second
  • PSA runs at 2.23 iterations per second

This results in PSA being approximately 3 seconds slower per training epoch, which is quite negligible.

审稿意见
6

This paper propose a probabilistic slot attention method that can learn identifiable object-centric representation. Compared with former work on identifiable object centric representation methods, the proposed model can scaling slot-based methods to high-dimensional images. Both theory analysis and experiments verify the effectiveness of proposed model.

优点

  1. This paper propose a novel probabilistic slot attention model, which can learn Identifiable Object-Centric Representation.

  2. the proposed model can scale to high-dimension image dataset.

  3. the proposed model is seem to be solid.

  4. the paper is well written and read well.

缺点

  1. the experiments are only take on two toy datasets.

问题

  1. How much computational complexity has increased?

局限性

N/a

作者回复

We thank the reviewer for their effort and overall positive outlook on our paper. We are encouraged to read that our work is perceived as solid, novel, and well-written. Please see our responses to the questions raised below.

"the experiments are only take on two toy datasets."

We kindly remind the reviewer that the objective of our work is to study the theoretical identifiability of slot representations and the conditions that ensure this property, rather than to pursue state-of-the-art empirical results. Understanding when object-centric representations can theoretically be identified is crucial for scaling slot-based methods to high-dimensional images with correctness guarantees. To verify our theoretical results, we first conduct detailed experiments on controlled datasets and then extend our demonstrations to unstructured image data. We stress that the synthetic datasets we used are necessary for properly testing our identifiability hypotheses.

Before our work, there was a lack of explanatory theory for why state-of-the-art results were able to be obtained using non-additive autoregressive Transformers (DINOSAUR [60]) and/or diffusion-based decoders (Slot-Diffusion [r2]). We showed that by viewing slot attention through a probabilistic graphical modelling perspective it is possible to prove slot identifiability for non-additive decoders using proof techniques from the identifiable generative modelling literature. Given that vanilla slot attention can be seen as a simplified version of probabilistic slot attention, akin to the relationship between soft k-means and GMMs, our theoretical results suggest why non-additive decoder structures can work well given the appropriate latent structure and inference procedure are in place.

Nonetheless, we have now evaluated our method on real-world large-scale datasets and using more powerful decoders to demonstrate that our method also scales well - please find the details of our experiments in the general comment at the top.

"How much computational complexity has increased?"

We emphasise that probabilistic slot attention (PSA) retains the O(TNKD)\mathcal{O}(TNKD) computational complexity of vanilla slot attention, where TT denotes the number of attention iterations, NN the number of input vectors, KK the number of slots and DD the slot/input dimension. The additional operations we introduce for calculating slot mixing coefficients and slot variances (under diagonal slot covariance structure) have complexities of O(NK)\mathcal{O}(NK) and O(NKD)\mathcal{O}(NKD) respectively, which do not alter the dominant term. Furthermore, when used in conjunction with additive decoder-based models, PSA can reduce computational complexity by pruning inactive slots via automatic relevance determination (ARD) as outlined in Section 4.

评论

Dear Reviewer 11Sk,

As the author-reviewer discussion period is soon coming to a close, we kindly ask the reviewer to take the opportunity to engage with us. We sincerely appreciate the time and effort the reviewer has already contributed to the review of our work and hope our thoughtful rebuttal addresses your concerns.

Best wishes, The Authors

评论

Experiments conducted by the authors to verify the validity of the model on a larger dataset have resolved my doubts and I maintain my score.

评论

Thank you for engaging with us, and we are very glad to hear our new experiments addressed all your concerns.

We appreciate your response and the reconsideration of the score. Could you kindly elaborate on why you would remain sceptical, given only a 'weak accept' and not a higher score? Are there any issues that we may not have addressed in our rebuttals? This would be very helpful.

Many thanks for taking the time, much appreciated.

作者回复

We extend our thanks to all the reviewers for their time and constructive feedback which has undoubtedly helped improve the paper. We are pleased that the work was perceived to be well-written, well-presented, and novel, with solid results and of interest to the community.

In the following, we highlight the main clarifications of our work raised by multiple reviewers and provide additional large-scale experimental results addressing all requests (see Table below and the attached pdf).

General Clarifications:

As correctly noted by all reviewers, the primary focus of our work is theoretical. To the best of our knowledge, before our work, there was a lack of explanatory theory for why state-of-the-art results were able to be obtained using non-additive autoregressive Transformers (DINOSAUR [60]) and/or diffusion-based decoders (Slot-Diffusion [r2]). We showed that by viewing slot attention through a probabilistic graphical modelling perspective it is possible to prove slot identifiability for non-additive decoders using proof techniques from identifiable generative modelling. Given that vanilla slot attention can be seen as a simplified version of probabilistic slot attention, akin to the relationship between soft k-means and GMMs, our theoretical results suggest why non-additive decoder structures can work well given the appropriate latent structure and inference procedure are in place.

However, there is a trade-off between identifiability and expressivity induced by the choice of decoder structure [45]. Depending on the use case, it may be beneficial to combine both latent and additive decoder structures in practice, particularly if the latter introduces useful inductive biases and/or simplifies the optimization problem.

We stress that probabilistic slot attention (PSA) retains the O(TNKD)\mathcal{O}(TNKD) computational complexity of vanilla slot attention (SA). The additional operations we introduce for calculating slot mixing coefficients and slot variances (under diagonal slot covariance structure) have complexities of O(NK)\mathcal{O}(NK) and O(NKD)\mathcal{O}(NKD) respectively, which do not alter the dominant term. Furthermore, when used in conjunction with additive decoder-based models, PSA can reduce computational complexity by pruning inactive slots via automatic relevance determination (ARD) as outlined in Section 4.

Large-scale Experiments:

We empirically tested slot identifiability using more complex non-additive, transformer decoders, following the SLATE [61] implementation and simply replaced the sloat attention (SA) module with probabilistic slot attention (PSA). On the CLEVR dataset, we observed an SMCC of 0.73±0.04\mathbf{0.73 \pm 0.04}, and R2 of 0.55±0.06\mathbf{0.55 \pm 0.06}, which are significantly better than all other models listed in Table 2 in the paper.

To demonstrate that PSA can scale to large-scale real-world data we ran additional experiments on the Pascal VOC2012 dataset, following the exact "DINOSAUR" strategies and setups described in [60, r4] for fairness, then simply swapping out SA with PSA . Note that SA MLP (w/ DINO) denotes our replication of DINOSAUR MLP from [60] as a baseline. The table below shows the obtained results (all baselines are standard results taken from [60, r3]):

ModelsmBOi\text{mBO}_imBOc\text{mBO}_c
Block Masks0.247 ± 0.0000.247 \ \small{\pm \ 0.000} 0.259 ± 0.0000.259 \ \small{\pm \ 0.000}
SA0.222 ± 0.0080.222 \ \small{\pm \ 0.008}0.237 ± 0.0080.237 \ \small{\pm \ 0.008}
SLATE0.310 ± 0.0040.310 \ \small{\pm \ 0.004}0.324 ± 0.0040.324 \ \small{\pm \ 0.004}
Rotating Features0.282 ± 0.0060.282 \ \small{\pm \ 0.006}0.320 ± 0.0060.320 \ \small{\pm \ 0.006}
DINO k-means0.363 ± 0.0000.363 \ \small{\pm \ 0.000}0.405 ± 0.0000.405 \ \small{\pm \ 0.000}
DINO CAE0.329 ± 0.0090.329 \ \small{\pm \ 0.009}0.374 ± 0.0100.374 \ \small{\pm \ 0.010}
DINOSAUR MLP0.395 ± 0.0000.395 \ \small{\pm \ 0.000} 0.409 ± 0.0000.409 \ \small{\pm \ 0.000}
Ours:
SA MLP (w/ DINO)0.384 ± 0.0000.384 \ \small{\pm \ 0.000}0.397 ± 0.0000.397 \ \small{\pm \ 0.000}
SA MLP (w/ DINO)^{\ddagger}0.400 ± 0.0000.400 \ \small{\pm \ 0.000}0.415 ± 0.0000.415 \ \small{\pm \ 0.000}
PSA MLP (w/ DINO)0.389 ± 0.0090.389 \ \small{\pm \ 0.009}0.422 ± 0.0090.422 \ \small{\pm \ 0.009}
PSA MLP (w/ DINO)^{\ddagger}0.405 ± 0.0100.405 \ \small{\pm \ 0.010}0.436 ± 0.0110.436 \ \small{\pm \ 0.011}
PSA Transformer (w/ DINO)^{\star}0.435 ± 0.01\mathbf{0.435} \ \small{\pm \ 0.01}0.499 ± 0.01\mathbf{0.499} \ \small{\pm \ 0.01}

^{\ddagger} Using slot attention masks rather than decoder alpha masks for evaluation.

^{\star} Trained for \approx15K steps only due to time constraints (250K are needed).

The results show that PSA is competitive with SA at scale.

Finally, we have also included basic illustrations of compositional samples from the aggregate posterior on both CLEVR and Objects-Room datasets to verify our theory. Please note that these models are quite small and were not optimized for sample quality since they were used primarily to measure slot identifiability across runs in our main experiments.

[r2] Wu, Z., Hu, J., Lu, W., Gilitschenski, I. and Garg, A., 2023. Slotdiffusion: Object-centric generative modeling with diffusion models. Advances in Neural Information Processing Systems, 36, pp.50932-50958.

[r3] Löwe, S., Lippe, P., Locatello, F. and Welling, M., 2024. Rotating features for object discovery. Advances in Neural Information Processing Systems, 36.

[r4] Kakogeorgiou, I., Gidaris, S., Karantzalos, K. and Komodakis, N., 2024. SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 22776-22786).

最终决定

The authors introduces a novel approach to solving the problem of identifiability in object-centric representations by proposing a probabilistic variant of Slot Attention that leverages a Gaussian Mixture Model (GMM) on the latent space. This method effectively shifts the focus from assumptions about the generator to the latent distribution, providing a robust solution for identifying ground-truth object representations. The authors support their approach with solid theoretical proof and validate it through both toy data and high-dimensional image experiments.

The main issue raised by the reviewers is the lack of additional large-scale experiments. While such experiments would further strengthen the paper, I believe the theoretical contributions on the slot identifiability and the current experimental results already represent a substantial and valuable contribution, warranting acceptance.