4.8

/10

withdrawn4 位审稿人

最低3最高6标准差1.1

4.5

置信度

正确性2.3

贡献度2.3

表达2.8

ICLR 2025

Seeing the part and knowing the whole: Object-Centric Learning with Inter-Feature Prediction

Junhong Zou,Xiangyu Zhu,Zhen Lei

OpenReview PDF

提交: 2024-09-23更新: 2024-11-15

摘要

关键词

Object-Centric LearningSelf-Supervised LearningComputer Vision

评审与讨论

审稿意见

评分: 3置信度: 52024-10-29

The paper presents a predictor that can forecast the image features at a specific position based on another feature. Additionally, it introduces an object-centric learning method that encourages the assignment of image features, which can effectively predict each other using the pre-trained predictor, to the same slot, and vice versa. Experiments conducted on multiple datasets demonstrate the effectiveness of the proposed method.

优点

The proposed method is innovative and straightforward to implement.
Experiments demonstrate that this method outperforms recent object-centric learning approaches across three datasets.

缺点

The proposed predictor does not align with the title "Seeing the part and knowing the whole." In reality, the predictor only observes one part and recognizes another. Additionally, the proposed OCL method does not demonstrate the ability to complete the entire object based on the occluded part. Figure 1 in the paper is also misleading. I would appreciate it if the authors could clarify whether their method can actually complete whole objects from parts or not.
The learnable segmentation M is not depicted in Figure 2. I suggest the authors update Figure 2 to include the learnable segmentation M in line 173 around the alpha mask.
The presentation of the paper could be improved. The figures are not inserted in PDF format, and some expressions are informal (e.g., the term ‘clamp(a, b)’ in line 251).
Compared methods, such as LSD and DINOSAUR, perform better on complex datasets like CLEVRTEX, MOVi-E, and COCO. However, the datasets selected in this paper are relatively simple, raising concerns about the scalability of this approach to more complex data. I suggest authors include their methods on more complex datasets (CLEVRTEX, MOVi-E, and COCO), or disccus their limitation on applying the proposed method to more complex datasets.
The rationale for using L1 loss instead of L2 loss as the reconstruction loss is unclear, I would appreciate it if authors could provide their rationale for choosing L1 loss and include an ablation study comparing different loss functions (e.g., L1 vs L2) in their experiments.

问题

According to the attachment, the decoder used for the MOVi-C dataset is transformer-based. How does this type of decoder generate the alpha masks needed to compute the prior loss?
Is it possible to apply this method to more complex datasets, as well as images with higher resolutions? Achieving good results on more challenging datasets could enhance the soundness of the proposed method.
The visualization in Figure 3 raises some questions, particularly regarding the results of BOQSA on the PTR dataset, which seem to outperform the visualizations in the original paper. Were any techniques implemented to improve the performance?

审稿意见

评分: 5置信度: 52024-11-02

This paper proposes a new regularization for constructing the holistic object slot in OCL, which is achieved by utilizing the inter-predictability among the features from different parts of the same object.

优点

The writing is good and easy to follow.

缺点

There are some weakness and questions that required to be answered clearly:

For the gestalt ability of our human, are the different parts naturally predicted according to the appearance/structure? Or, human predict the missing parts given the prior that we already have a semantic understanding? Any proof?
For the inter-predictability among the features, how to deal with the semantic/object occurance in the real world? For example, given the high occurance of the keyboard and mouse, their feature could be with high inter-predictability. How to limit the inter-predictability on the component/part level?
For the prediction of similarity, relative postion should be used. The utilization of absolute position is wrong, which cannot reveal the structural information among the parts. Moreover, it will leads to lots of noise for semantic understanding, since object semantics are postion invariant.
The training of the similarity prediction seems require supervision, which is unfair for other unsupervised methods.
Given the training loss for similarity prediction only focuses on increasing the cosine similarity, any possible that a trival solution will appear, which predict high similarity for any pair of features.

问题

Please refer to the weaknesses.

审稿意见

评分: 6置信度: 42024-11-03

The paper introduces an interesting idea to object centric learning (ocl) called Predictive Prior, inspired by human perception abilities. Traditional ocl models use an auto-encoding paradigm to create object representations by assigning image features to discrete object "slots" and reconstructing images from these slots. However, these models struggle with complex object appearances due to reliance on color or spatial regularities.

The Predictive Prior approach leverages the principle that features belonging to the same object can predict each other. It trains a prediction network to assess the mutual predictability between features across different spatial locations within an image. This prediction-based relationship is then used to guide object-slot assignments.

Experiments on datasets such as MOVi-C, Super-CLEVR, and PTR show that the Predictive Prior-based model outperforms previous OCL methods in object discovery, compositional generation, and visual question answering (VQA).

优点

The work shows strong results across various datasets and baselines.
The idea is interesting and is well analyzed.
Clean writing and figures

缺点

Missing Ablations, There are multiple loss functions used, however they haven't been ablated
Certain SoTA baselines on Dino for unsupervised segmentation are missing such as : CuTLER (https://arxiv.org/pdf/2301.11320https://arxiv.org/pdf/2301.11320)

问题

Can the authors compare or discuss results against methods such as CuTLER that use dino features and get good results.
How much role does pre-trained Dino play in the improvement across baselines. What if the authors trained from scratch using the new objective.
Can the authors ablate reconstruction vs their proposed objective, how does switching of one of them affect final accuracy?

审稿意见

评分: 5置信度: 42024-11-03

Humans instinctively decompose scenes into objects, enabling strong visual understanding. Object-Centric Learning (OCL) seeks to encode scene information into object vectors called ‘slots.’ Traditional OCL models use an auto-encoding approach, reconstructing images from these slots but often fail with complex objects, as reconstruction alone doesn’t ensure accurate object grouping. To improve this, this paper introduces a Predictive Prior inspired by human gestalt perception, where features of the same object can predict each other. This prior is implemented as an external loss, guiding the model to group predictable features into the same slot and separate those that aren't. The paper shows decent results on SuperCLEVER, MoVI-C etc.

优点

Overall intuition of the paper makes sense and representing part and whole of a scene is a pretty important problem as discussed in [1] and in the literature multiple times.
Results on SuperCLEVER, MoVI-C are decent and show the efficacy of the current method pretty well.

缺点

Results are missing on real world datasets like COCO & OpenImages. The current results are on CLEVEr and MoVI-C which are not very representative of real world results.
Comparison with diffusion based approaches like SlotDiffuzr[1], SysBinder[2] is missing. Adding diffusion based method will be pretty critical to the paper.
Computational cost in adding the PREDICTIVE PRIOR? It would be good to see the computational cost added by newer modules introduced in the paper.
Discussion on [3] should definitely be added in the paper. References:
SlotDiffuzr: SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
NEURAL SYSTEMATIC BINDER
How to represent part-whole hierarchies in a neural network.

问题

Overall the paper is good, but the results are missing on large scale real-world datasets which is definitely a issue in the current version.

撤稿通知

2024-11-15

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.