Identifiable Object Representations under Spatial Ambiguities
We introduce a probabilistic model that resolves spatial ambiguities and provides theoretical guarantees for identifiability without additional viewpoint annotations.
摘要
评审与讨论
The paper presents a multi-view probabilistic approach aimed at learning modular object-centric representations that are essential for human-like reasoning. This paper introduces View-Invariant Slot Attention (VISA), which addresses spatial ambiguities caused by occlusions and view ambiguities. This method aggregates view-specific slots to capture invariant content information while simultaneously learning disentangled global viewpoint-level information. Unlike prior single-view methods, this approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requires no viewpoint annotations.
给作者的问题
No
论据与证据
This paper highlights that while OCLOC focuses on achieving object-consistency unconditional to views, this approach explicitly learns view-invariant object representations. The paper provides theoretical guarantees for identifiability in cases of partial or full occlusions without additional view information, which advances beyond previous work in single-view OCL. The use of spatial Gaussian mixture models in latent distribution across viewpoints to encourage identifiability without auxiliary data is justified. The experimental results across multiple datasets provide convincing evidence for their theoretical claims.
方法与评估标准
The evaluation criteria focuses on three key claims: identifiability, invariance, and equivariance. The authors use appropriate metrics such as slot mean correlation coefficient (SMCC) and invariant SMCC (INV-SMCC) to quantify their results. The comparison with various baselines, including standard additive autoencoder setups, slot-attention (SA), probabilistic slot-attention (PSA), MulMON, and OCLOC, provides a thorough assessment of the performance.
理论论述
was not reviewed in depth
实验设计与分析
The paper includes extensive experimental validation on standard benchmarks (CLEVR-MV, CLEVR-AUG, GQN) and complex datasets (MVMOVI-C and MVMOVI-D), demonstrating the robustness of this method.
补充材料
was not reviewed in depth
与现有文献的关系
I am not familiar with the literature in this area
遗漏的重要参考文献
I am not familiar with the literature in this area
其他优缺点
No
其他意见或建议
No
We thank the reviewer for their feedback and are glad that the reviewer found our experiments to be extensive and method to be robust.
The primary focus of the work is that it provides theoretical guarantees for identifiability in multi-view scenarios, and requires no viewpoint annotations, which builds upon the formalisms built in a single view scenario in Kori et al. (2024); Brady et al. (2023); Lachapelle et al. (2023). To the best of our knowledge, this is the first work addressing explicit formalisations of assumptions and theory required for achieving this. We also provide empirical evidence with synthetic datasets, where the transformation in distribution clearly demonstrates our claims.
In order to make the paper more self contained, based on other reviewers feedback we plan to include complexity argument and metrics details, along with model architectures as below.
VISA complexity: VISA achieves with the added complexity of for inverse and forward view point transformation given by , while it retains the complexity per view to be which is the same as slot attention and probabilistic slot attention. Additionally, the representation matching function contributes : this term does not alter the dominant term, in the general case when . Similar to PS, when VISA is combined with an additive decoder, the complexity of the decoder can be lowered due to the property of automatic relevance determination (ARD), eliminating the need to decode inactive slots.
SMCC details: we borrow the definition of SMCC from Kori et al., 2024. For two sets of slots and , where , extracted from scenes, the SMCC between any and is obtained by matching the slot representations and their order. The order is matched by mapping slots in w.r.t assigned by ,
followed by a learned affine mapping between aligned and :
By design the SMCC metric is bounded between [0, 1], with the higher the value the better. We will add these details in the appendix.
Decoder architecture: As detailed in the paper we use two different classes decoder architectures (i) additive and (ii) non-additive, within additive we use both spatial broadcasting and MLP decoders, for non-additive we use transformer decoders. In terms of architectural, we follow SA(Locatello et al., 2020) for spatial broadcasting decoders and DINOSAUR(Seitzer et al., 2023) for both MLP and transformer decoder, we will describe architectures in detail in the appendix as in the response of reviewer zW6Z:
Thanks for the rebuttal. After reading the other reviews and the rebuttal, I recommend weak acceptance of this paper. I encourage the authors to revise the paper to incorporate the rebuttal, either in the main text or in the supplementary materials.
The paper aims to learn identifiable object representations even under spatial ambiguities, i.e., occlusions and view ambiguities. The authors propose View-Invariant Slot Attention (VISA), a probabilistic slot attention variant to learn such representations. Theoretic analysis is provided to prove identifiable under given assumptions. Empirical results on synthetic datasets are shown to verify the model.
update after rebuttal
Please see the rebuttal comment below.
给作者的问题
- The patterns in the result are not clearly explained. For example, could you elaborate more on 'vary by an affine transformation' in the caption of Figure 6? How do you evaluate and compare the results?
- Can you elaborate more on the result that the computed SMCC is 0.72 (on page 7, line 379-right column). How would you interpret this value?
- In the real world settings, when applying your model, we need to ensure viewpoint sufficiency as an assumption. Do you have some experiment results on the real-world datasets? If not, based on your data generation process, do you have some insights on how to ensure this assumption if other researchers are going to apply your model? The reviewer addresses this as a limitation, independent of the technical details within this assumption.
论据与证据
Yes.
方法与评估标准
The proposed method is reasonable. The synthetic datasets are newly-proposed but reasonable. The evaluation metrics are referenced but no details.
理论论述
The theoretical claims in the main paper are checked. To the best of my knowledge, the claims are reasonable and correct.
实验设计与分析
The data generation process (Figure 7) and experimental results in the main paper are checked. The patterns in the result are not fully-explained. For example, 'vary by an affine transformation' in the caption of Figure 6 could be further explained/illustrated.
补充材料
The reviewer reviews all the parts, but can not guarantee all proofs are correct in part F.
与现有文献的关系
This work extends the field of object-centric representation learning. This work explicitly learn view-invariant object representations, rather than learning the object-consistency representations unconditional to views in the prior work.
遗漏的重要参考文献
The reviewer is not aware of any missing essential references.
其他优缺点
Strength:
- The paper is well-written, with a clear structure and thorough proof.
- The intuition and example sections are helpful for understanding the proof.
Weakness:
- As mentioned in the weakness section on page 8, the viewpoint sufficiency assumption is strong. The experiments are conducted on the well-designed dataset. It is questionable whether proposed model could be applied on real-world data.
- The evaluation metrics are not well-introduced. It is a little bit hard to understand how good the numbers represent. For example, the computed SMCC is 0.72 (on page 7, line 379-right column), but hard for the readers to understand how good this number is.
其他意见或建议
Some typos:
- In definition 3.1 (line 160-right column), there is a c before colon, but not showing after.
- The caption for figure 5 (line 341): feature feature distribution.
We thank the reviewer for their detailed feedback and are glad that the reviewer finds our paper well written, clear, and with reasonable claims and proofs. We are also glad that intuitions aided in the understanding of theorems and our claims.
As mentioned in the weakness section on page 8, the viewpoint sufficiency assumption is strong. It is questionable whether proposed model could be applied on real-world data
We do agree experiments on large real-world datasets would be helpful, however to the best of our knowledge there aren’t any real-world datasets with many multiple viewpoints: this is was the main motivation for proposing the synthetic dataset which consists of 72,000 x 5 number of images with 72,000 scenes with 5 viewpoints(in terms of variation it consists of 930 objects variety with 458 complex backgrounds).
Additionally, the primary focus of the work is that it provides theoretical guarantees for identifiability in multi-view scenarios, and requires no viewpoint annotations, which builds upon the formalisms in a single view scenario in Kori et al. (2024); Brady et al. (2023); Lachapelle et al. (2023). To the best of our knowledge, this is the first work addressing explicit formalisations of assumptions and theory required for achieving this.
Having said that, if the reviewer can point to any large real-world dataset we would be happy to consider it for the final version of the paper.
The evaluation metrics are not well-introduced. It is a little bit hard to understand how good the numbers represent.
Thanks for pointing this out. We borrow the definition of SMCC from Kori et al., 2024. For two sets of slots and , where , extracted from scenes, the SMCC between any and is obtained by matching the slot representations and their order. The order is matched by mapping slots in w.r.t assigned by ,
followed by a learned affine mapping between aligned and :
By design the SMCC metric is bounded between [0, 1], with the higher the value the better. We will add these details in the appendix.
For example, could you elaborate more on 'vary by an affine transformation' in the caption of Figure 6? How do you evaluate and compare the results?
In the context of Figure 6 the behaviour of affine transformation implies the distribution across view sets, only differ in scale and translation. We do agree with complications in analysing higher dimensional latents. Figure 5-6 was our attempt on this visualising the feature-wise aggregated distribution to capture the trend. We have discussed the behaviour in the paper, but will expand it for clarity.
Additionally, we also have included a 2D-variant experiment in appendix G.1 where the distributions can be seen to be equivalent up to an affine transformation, validating our claims. Depending on space availability we will bring the 2D experiments to the main paper.
do you have some insights on how to ensure this assumption if other researchers are going to apply your model? The reviewer addresses this as a limitation, independent of the technical details within this assumption.
Theoretically, verifying the viewpoint sufficiency assumption is challenging. In some cases, as few as two viewpoints may be enough to satisfy this assumption, while in more complex scenes, adding additional views can improve visibility. However, beyond a certain number of viewpoints, we expect diminishing returns in performance, as the marginal gain from each additional view decreases. When we have control over the data generation process, we can adopt an adaptive viewpoint selection strategy. This could involve dynamically selecting views based on occlusion-aware heuristics or ensuring a larger set of viewpoints that are equidistant from the scene of interest while varying the angle and azimuth. This approach helps mitigate occlusion issues and ensures more robust object visibility, we will include this discussion in the main paper.
Thanks for the detailed rebuttal. The authors have addressed most of my concerns.
I preserve the point that limited applicable scenarios is a major limitation. As the authors pointed out, ''verifying the viewpoint sufficiency assumption is challenging'' and there is no real-world datasets that could be easily adapted to verify the approach. However, I acknowledge the technical contribution of providing theoretical guarantees for identifiability under the assumption.
I have also read the reviews from other reviewers and authors' rebuttal. I would like to maintain my original score - weak accept, as the final rating.
Thank you very much for your response, we are glad that most of your concerns are addressed.
We do agree that verifying view sufficiency is challenging, however we respectfully disagree with the assessment of limited applicability to real-world datasets. To demonstrate the applicability in the scenarios where the view sufficiency is not met, we illustrated the model's performance on Mv-MoviD dataset, which is generated by dynamically sampling the camera positions that are varied across scenes, please refer to the data generation process in appendix D.2 and the involved discussion in appendix G.4.
In terms of real world data, we considered the MVImgNet dataset – we randomly selected 1, 10, 15, 20 viewpoints to extract multiple images of the rendered scene and performed-VISA inference, please find the results in terms of mean best overlap matching (mBO) metric in the table bellow.
| Methods | NViews = 1 | NViews = 10 | NViews = 15 | NViews = 20 |
|---|---|---|---|---|
| SA-MLP | 0.29 | |||
| PSA-MLP | 0.30 | |||
| SA-Transformer | 0.36 | |||
| PSA-Transformer | 0.34 | |||
| VISA-MLP | 0.29 | 0.34 | 0.52 | 0.53 |
| VISA-Transformer | 0.36 | 0.58 | 0.62 | 0.62 |
Note that even though the considered real-data is single object dataset, the results here are zero-shot, implies we considered the model trained on Mv-MoViD dataset to do a direct inference on this dataset, which could be easily improved by training on these specific dataset, we do this test just to demonstrate the applicability of the proposed method rather than improving the state of the art results.
As seen from the results VISA still performs better as more views are considered. While view sufficiency cannot be directly verified, it can be indirectly assessed through downstream performance in the context of the task at hand. In this case, we conclude that selecting 15 random viewpoints is sufficient to achieve the desired performance.
These experiments do show the applicability of the proposed method in real-world settings, as opposed to reviewers initial perception, please consider this in making a final decision.
This paper focuses on object-centric learning and proposes View-Invariant Slot Attention (VISA). It extends the probabilistic slot attention (PSA) into multi-view scenarios. It introduces a content descriptor, learns identifiable object-centric representations from multi-view observations and accounts for occlusion and view ambiguity that emerges in multi-object settings. Theoretical analysis is provided. Empirical experiments on several datasets demonstrate VISA's good performance compared with other object-centric learning baselines.
给作者的问题
- The K in equation 6 appears to be a hyperparameter of choice that controls the number of components in GMM. Is the performance sensitive to K?
- How is The Transformer used for VISA and PSA? Is there any modification? What are the formats of the inputs and outputs?
- If some object is completely occluded, how do you identify the object and its visibility without knowing the 3D prior of the environment?
论据与证据
The review finds it difficult to collect enough empirical evidence to support the claims from the paper. The experiment results do not straightforwardly demonstrate identifiability, viewpoint invariance, and spatial ambiguity. It is difficult to parse the curves in Figure 5 and 6. There is no visualization of any scene to illustrate these results. (W1)
方法与评估标准
The reviewer finds it difficult to assess the novelty of the method.
It seems to be strongly connected to PSA while the writing of Section 4 fails to directly point out the connection and extension. It is difficult to associate the equations in Section 4 with equations in Section 3. For example, while the formalization in the beginning of section 4 is clear, the reviewer fails to see the similarity and difference between the "Viewpoint specific slots" and a single-view PSA. Since the PSA is introduced in Section 3 as preliminary, Section 4 should refer to it to help explain the new method (VISA) and where it differs from PSA. (W2)
Meanwhile, the details of the model implementation are heavily missing. The writing vaguely suggested use of MLP and Transformer without any details such as the format of inputs and outputs, or if there is any modification. (W3)
理论论述
I did not rigorously check the correctness of any proofs.
实验设计与分析
The experiments are conducted on several benchmarks, including public (CLEVR, GQN, GSO) and newly generated ones (mv-MoVIC, mv-MoVID). VISA's results appear to be superior. (S1)
Table 2 aims to show that the proposed VISA generalizes to novel views, however, the performance gap is minimal for all the baselines as well. Perhaps more diverse viewpoints should be provided, but there is no clue as the test environments are not illustrated in the paper. (W4)
补充材料
I mainly reviewed the figures and tables. I did not review the proof.
与现有文献的关系
The related work is sufficient. There is an additional section of related work in the appendix.
遗漏的重要参考文献
None.
其他优缺点
Strengths:
- See S1 from the above discussions.
- S2: the paper tackles multi-view object-centric representation learning, which the reviewer believes to be an important and novel topic.
Weaknesses:
- See W,1 W2, W3 and W4 from the above discussions.
其他意见或建议
- typo in line 253-254, there are two and no .
- typo line 340-341 "feature feature" distribution
- For visualization of the mvMovi-C in figure 13, there is no visualization of any cluttered environments where objects are occluded.
We thank the reviewer for their detailed feedback and are glad to see that the reviewer acknowledges our superior performance and believes the paper to address novel and important topic.
(W1)It is difficult to parse the curves in Figure 5 and 6. …
We do agree with complications in analysing higher dimensional latents: Figure 5 and 6 are an attempt to visualise the feature-wise aggregated distribution to capture the trend. While we have described the behaviour of this distribution in the discussion, we will expand it for clarity in the final version. Additionally, we also have included a 2D-variant experiment in appendix G.1 where the distributions can be seen to be equivalent up to an affine transformation validating our claims. We would also like to point to qualitative results included in appendix Figure 10-14. Depending on space availability we will bring the 2D experiments to the main paper.
(W2)It seems to be strongly connected to PSA while the writing of Section 4 …
As mentioned in a paragraph Viewpoint specific slots (L193-L214), the viewpoint specific slots are extracted with EM algorithm as proposed in Kori et al., 2024 (which is same as PSA) but applied on a transformed encoder output, with a transformation corresponding to a viewpoint specific inverse transformation. This is done to project all the slot representations for multiple views into a common vector space.
As pointed out by the reviewer and in the viewpoint specific slots paragraph, the algorithm here is the same as PSA with the only difference being with the transformation of inputs. We will cross reference section 3 in that paragraph to make this difference more explicit.
Additionally, the primary focus of the work is that it provides theoretical guarantees for identifiability in multi-view scenarios, and requires no viewpoint annotations, which builds upon the formalisms in a single view scenario in Kori et al. (2024); Brady et al. (2023); Lachapelle et al. (2023). To the best of our knowledge, this is the first work addressing explicit formalisations of assumptions and theory required for achieving this.
(W3) Meanwhile, the details of the model implementation are heavily missing…
Thanks for pointing this out, we will include them:
Decoder architecture: As detailed in the paper we use two different classes of decoder architectures: (i) additive and (ii) non-additive; within the additive architecture we use both spatial broadcasting and MLP decoders; for the non-additive architecture we use transformer decoders. Specifically, we follow SA (Locatello et al., 2020) for spatial broadcasting decoders and DINOSAUR (Seitzer et al., 2023) for both MLP and transformer decoder: for details about architectures please refer to the response of reviewer zW6Z
The K in equation 6 appears to be a hyperparameter of choice that …?
That's a valid point, however the dependency on K is inherent to SA and PSA, here we develop on these work to address spatial ambiguities by considering multiple viewpoints. Similar to the ARD study in Kori et al., 2024, during inference we did observe when K is set to higher that the required number, the model ignores the additional slots by assigning the mixing coefficient to 0, however lower K affects performance, similar to the ablations in Locatello et al., 2020.
If some object is completely occluded, how do you identify the object and its visibility without knowing the 3D prior of the environment?
That’s a great point. One of our key assumptions is viewpoint sufficiency, meaning that each object in the environment is visible in at least one of the considered viewpoints. If an object is completely occluded across all viewpoints, identifying it falls beyond the scope of this work. We will make this explicit in the paper
Thanks for the rebuttal. I believe the authors have addressed most of my concerns. I am willing to raise my rating to weak accept.
The paper introduces View-Invariant Slot Attention (VISA), a probabilistic object-centric learning model designed to achieve identifiable object representations from multi-view images without explicit viewpoint annotations. VISA overcomes limitations of single-view methods by resolving spatial ambiguities like occlusions and viewpoint variations. The authors provide theoretical guarantees of identifiability using latent spatial Gaussian Mixture Models (GMMs) and empirically validate the approach on synthetic and newly proposed datasets (MVMOVI-C and MVMOVI-D).
update after rebuttal
The authors addressed my concerns in the rebuttal. I will keep my rating.
给作者的问题
Nil.
论据与证据
The claims made in the submission are supported by clear and convincing evidence and experiments.
The authors state their model "demonstrates scalability on two new complex datasets (MV-MOVI-C and MV-MOVI-D)". Although Table 2 provides evidence of good performance on these datasets, more extensive details on computational complexity, model training time, parameter counts, and extensive ablations on larger real-world datasets would strengthen scalability claims.
方法与评估标准
The methods and evaluation criteria proposed in this paper are mostly suited to the problem addressed.
理论论述
I briefly check the correctness of the theoretical proofs presented in the paper.
实验设计与分析
The experimental designs on synthetic datasets are sound, while short of real-world datasets.
补充材料
I review the experiments part of the supplementary materials.
与现有文献的关系
The paper positions itself clearly in relation to the broader literature on object-centric representation learning (OCL), nonlinear independent component analysis (ICA), and representation identifiability.
遗漏的重要参考文献
Nil.
其他优缺点
Although empirical evidence from synthetic data and benchmarks strongly suggests correctness, explicit verification for complex real-world cases remains a limitation.
其他意见或建议
L72-73, the set symbol may conflict with the object symbol.
We thank the reviewer for their detailed feedback and are glad that the reviewer found our claims were supported by clear and concise evidence with correct proofs and sound experiments.
more extensive details on computational complexity, model training time, parameter counts
VISA complexity: VISA achieves with the added complexity of for inverse and forward view point transformation given by , while it retains the complexity per view to be which is the same as slot attention and probabilistic slot attention. Additionally, the representation matching function contributes : this term does not alter the dominant term, in the general case when . Similar to PS, when VISA is combined with an additive decoder, the complexity of the decoder can be lowered due to the property of automatic relevance determination (ARD), eliminating the need to decode inactive slots.
Decoder architecture: As detailed in the paper we use two different classes of decoder architectures: (i) additive and (ii) non-additive; within the additive architecture we use both spatial broadcasting and MLP decoders, for the non-additive architecture we use transformer decoders. Concretely, we follow SA(Locatello et al., 2020) for spatial broadcasting decoders and DINOSAUR(Seitzer et al., 2023) for both MLP and transformer decoder. In detail we use:
- spatial broadcasting decoders:
Input/Output: The generated slots are , each slot representation is broadcasted onto a 2D grid of dimension and augmented with position embeddings. Similar to slot attention, each such grid is decoded using a shared CNN to produce an output of size W × H × 4, where W and H are width and height of the image, respectively. The output channels encode RGB color channels and an (unnormalized) alpha mask. Further, we normalize the alpha masks with Softmax and perform convex combinations to obtain reconstruction.
Shared CNN architecture: 3 x [Conv (kernel = 5x5, stride=2), LeakyReLU(0.02)] + Conv (kernel = 3x3, stride=1), LeakyReLU(0.02)
- MLP decoders:
Input/Output: similar to the spatial broadcasting decoder, here, each slot representation is broadcasted onto N tokens (resulting in ) and augmented with position embeddings. Then the individual slot representation is transformed with a shared MLP decoder to generate a representation corresponding to feature dimension along with additional alpha mask, which is further normalised with Softmax and used in creating convex combinations to obtain reconstruction.
Shared MLP architecture: [Linear (d, d, bias = False), LayerNorm(d)] + 3 x [Linear (d, d_{hidden}), LeakyReLU(0.02)] + Linear (d_{hidden}, d_{feature}+1)
- Transformer decoders:
Input/Output: transformer consists of linear transformers encoder output and extracted slots as input, while returning the slot conditioned feature as output with a dimension of .
Transformer architecture: is made up of 4 transformer blocks, where each transformer block consists of a self-attention on input tokens, cross-attention with the set of slots, and residual two-layer MLP with hidden size . Before the Transformer blocks, both the initial input and the slots are linearly transformed to , followed by a layer norm.
Model training time: As detailed in appendix G.7, our training usually takes between eight hours to a couple of days, depending on the model and the dataset. We run all our experiments on a cluster with a Nvidia NVIDIA L40 48GB GPU cards, we will add a pointer in the main text.
extensive ablations on larger real-world datasets would strengthen scalability claims.
We do agree experiments on large real-world datasets would be helpful, however to the best of our knowledge, there aren’t any real-world datasets with many multiple viewpoints: this was the main motivation for proposing the synthetic dataset which consists of 72,000 x 5 number of images with 72,000 scenes with 5 viewpoints; in terms of variation it consists of 930 objects variety with 458 complex backgrounds.
Additionally, the primary focus of the work is to provide theoretical guarantees for identifiability in multi-view scenarios, while requiring no viewpoint annotations, which builds upon the formalisms in a single view scenario in Kori et al. (2024); Brady et al. (2023); Lachapelle et al. (2023). To the best of our knowledge, this is the first work addressing explicit formalisations of assumptions and theory required for achieving this.
Having said that, if the reviewer can point to any large real-world dataset we would be happy to consider it for the final version of the paper.
All the reviewers agree in the end that the paper is acceptable, the proposed method addresses an important problem for the object identification across different views with ambiguities. While the paper investigates theoretically the guarantees for the identifiability of the objects, various experimental investigations have also been carried out over both the standard and proposed custom datasets. The authors provided details in the rebuttal for all the concerns/comments, made various clarifications and explanations and provided more experimental results.