PaperHub
6.3
/10
Poster4 位审稿人
最低5最高8标准差1.1
5
6
6
8
3.3
置信度
正确性2.8
贡献度2.3
表达2.5
ICLR 2025

Multi-Scale Fusion for Object Representation

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-16
TL;DR

By processing the same image in different sizes, we obtain representations in multiple scales. And high quality object super-pixels in one scale can augment the low quaity super-pixels of the same object in other scales.

摘要

关键词
Object-Centric Learning (OCL)Variational Autoencoder (VAE)Multi-ScaleUnsupervised Object Segmentation

评审与讨论

审稿意见
5

This paper is motivated by Object-Centric Learning (OCL) AND introduces a multi-scale fusion scheme to vanilla VAE and improves its representation of different scales of objects. The motivation is convincing, but the technical novelty is limited and multi-scale fusion is a widely used technique in many models and computer vision tasks. Though this paper proposes some specific design modules, like multi-scale features augmentation, they are straightforward and improvements are limited according to the ablation study. Though the technical novelty is limited, this paper investigate an interesting view, OCL, to VAE, should be benefit the comunity.

优点

The MSF is straightforward and easy to understand. According to the experimental results, it works well and improves a strong basline model DINO. OCL is an interesting and promising direction and this paper outperform previous OCL methods. The method can be generalized to virous computer tasks and model to improve their representation of objects.

缺点

According to Sec. 4.1, the improvements from MSF are limited, just 1-2 points in many metrics. The overall presentation is not satisfactory and difficult to understand for readers not familiar with OCL.

问题

Can the proposed method be applied to state-of-the instance segmentation models, like MaskDINO, to improve their results (~50 mask AP) on MS-COCO for instance segmentation.

评论

Thank you for your feedback.

Weakness 1

According to Sec. 4.1, the improvements from MSF are limited, just 1-2 points in many metrics.

We would like to clarify possible misunderstandings.

Techniques OGDR and MSF in Tab. 1 are added to SLATE separately, i.e., the table contains results for SLATE+OGDR and SLATE+MSF, not SLATE+OGDR+MSF together. Therefore, MSF improves SLATE 5-7 points in both ARI and ARIfg, that is, 24.18->30.95@ARI and 24.54->30.47@ARIfg.

Besides, the methods we cover are all strong baselines, and any improvement can be challenging. We exclude earlier methods like IODINE, MONet, SA and SAVi due to their lower accuracy. In OCL literature, ARI and ARIfg (2nd and 3rd columns in those tables) are widely adopted metrics, revealing more significant differences among methods.

Weakness 2

The overall presentation is not satisfactory and difficult to understand for readers not familiar with OCL.

We have improved the presentation in our paper's new version.

And we would be happy to polish it further if you could provide specific examples or clarify which parts are unclear.

Question 1

Can the proposed method be applied to state-of-the instance segmentation models, like MaskDINO, to improve their results (~50 mask AP) on MS-COCO for instance segmentation.

Our MSF only modifies VAE's codebook quantization part, making it applicable to any model using codebook quantization. Unfortunately, MaskDINO does not have such a module. This requires further investigation in the future.

评论

Dear Reviewer 8Piy, following up on our response to your valuable feedback.

Your comments have been greatly appreciated and have helped refine the work.

If you had a chance to review the rebuttal, we would be grateful for any additional thoughts or clarifications, especially if there are unresolved concerns that we can address to improve the paper further.

Thank you again for your time and insights, looking forward to hearing from you soon.

审稿意见
6

This paper addresses the lack of multi-scale object representation in VAE-guided OCL problems by constructing a multi-scale VAE model. This enables the VAE-generated results to represent multi-scale objects, thereby improving performance across various tasks.

优点

  1. The paper is clear writing and easy to follow.

缺点

  1. I think the paper's novelty is very limited. Although I am not familiar with object-centric learning, I believe that constructing multi-scale VAE features has already been applied in many generative tasks[1, 2]. Therefore, the core innovation of this paper, in my opinion, does not warrant a standalone publication. Can the author tell me about the difference between MSF and other methods?

  2. Since I am not familiar with this field, the Area Chair (AC) and the authors may disregard my opinion if it appears to be biased.

[1]Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction, arxiv2404.02905 [2]SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs, NeurIPS2023

问题

I am not familiar with this field, so please refer to the weakness

评论

Thank you for your feedback.

Weakness 1

Although I am not familiar with object-centric learning, I believe that constructing multi-scale VAE features has already been applied in many generative tasks[1, 2]. Can the author tell me about the difference between MSF and other methods?

Please note that we have updated the related contents into Sect. "A.1 Extended Related Work" of our paper’s new version.

Short answer

  • VAR [1] vs MSF: VAR auto-regresses from smaller scales of VAE representation to the larger, but with no information fusion among scales. In contrast, our MSF realizes fusion among different scales on VAE for the first time.

  • SPAE [2] vs MSF: SPAE relies on multi-modal foundation model CLIP, while our MSF does not. SPAE element-wisely averages multiple scales into one, simply mixing different scales together, i.e., no fusion among scales. Our MSF augments any scaled representation with all other scales' high-quality representations, i.e., fusion among scales.

Long answer

[MSF] interpolates the input and produces multiple scales of VAE representation, which augment one another. Every scale integrates high-quality representations from the other scales, yielding better visual explanation (Fig. 3). Specifically,

(1) The input image TT is spatially interpolated into a pyramid {TnT_n}, from high to low resolutions, and after VAE encoding, we have multiple scales of VAE intermediate representation {ZnZ_n};

(2) We project it into two copies, one for scale-invariant quantization using a shared codebook, yielding {XnsiX_n^{si}}, and the other for scale-variant quantization using different codebooks, yielding {XnsvX_n^{sv}};

(3) We conduct our unique and novel inter-scale fusion and intra-scale fusion upon {XnsiX_n^{si}} and {XnsvX_n^{sv}}, yielding the augmented {XnX_n}, where whichever scale is augmented by the other scales with their high-quality representation, i.e., fusion among the scales;

(4) The final {XnX_n} is used to guide OCL training.

[VAR] interpolates the single VAE intermediate representation and produces multiple scales of VAE representations, where there is no augmentation in between. Specifically,

(1) The input image imim is VAE encoded once, yielding one single intermediate representation ff;

(2) The authors spatially interpolate ff into multiple scales and residually quantize them with a shared codebook, yielding a sequence of discrete representations R=[r1,r2,...rK]R=[r_1, r_2,... r_K], from low to high resolutions;

(3) There is no fusion among the scales;

(4) Finally the authors conduct auto-regression upon RR for image generation, i.e., p(r1,r2,...rk)=Πp(rkr1,r2,...rk1)p(r_1, r_2, ... r_k) = \Pi p(r_k | r_1, r_2,... r_{k-1}).

[SPAE] interpolates one VAE intermediate representation and yields multiple scales of VAE representations, but employs naive element-wise-average to integrate those scales into one. Specifically,

(1) The input II is VAE encoded once, yielding one single intermediate representation ZZ;

(2) The authors conduct spatial interpolation on it and yield {ZlZ_l}, and use pretrained multi-modal foundation model CLIP to residually quantize them under some prompt, producing a sequence of lexical tokens {Z^l\hat{Z}_l}, from low to high resolutions;

(3) The authors streaming average the first ll scales into one representation Z^_l\hat{Z}\_lThey have no fusion among the scales; Instead, they just mix all scales together.

(4) Finally the mixed representation is VAE decoded to produce the reconstruction or image generation.

Summary

Whether with VAE or not, it is not difficult to build multiple scales of representations; Outside of VAE, there are also many techniques to fuse multiple scales. But, our MSF is the first to both build multiple scales upon VAE and realize fusion among multiple scales, using unique and innovative designs based on codebook matching.

Reference

[1] Tian et al. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. NeurIPS 2024.

[2] Yu et al. SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs. NeurIPS 2023.

审稿意见
6

This paper addresses Object-Centric Learning (OCL), which aims to capture comprehensive object information by reconstructing inputs using intermediate representations from a Variational Autoencoder (VAE). The focus is on multi-scale training, acknowledging that objects may appear at various scales in images or videos due to changes in imaging distance or intrinsic size differences. The authors propose Multi-Scale Fusion (MSF) applied to VAE representations to guide OCL for both transformer-based approaches and state-of-the-art diffusion models. The method demonstrates promising performance on datasets such as COCO, ClevrTex, and VOC.

优点

  1. The overall intuition and direction of the paper is good and deal with important problems of Object-Centric Learning.
  2. The multi-scale training makes a lot sense since dealing with objects is an important problem and generally has not been solved fully in the field.
  3. The paper shows good analysis and shows object separability of VAE guidance in Fig3. Fig2 quantitative results also look good.

缺点

  1. Results on real world datasets especially COCO in Table 1 are really incremental. It’s not really clear how much advantage is by adding MSF.
  2. Analysis of varying object sizes can show results on scale understanding better than overall IoU improvement.
  3. Is there any intuition as to why the value of n is 3? Can we do a similar experiment on OpenImages and see if this holds true across datasets? For Openimages, if it’s easier you can try using the smaller subset of open images curated in this paper [1].
  4. Discussion on the area of unsupervised semantic segmentation and object detection using SSL methods should be added. Papers like [1,2,3] should be discussed. References:
  5. Hyperbolic Contrastive Learning for Visual Representations beyond Objects CVPR2023
  6. VideoCutLER
  7. MOST: Multiple Object localization with Self-supervised Transformers for object discovery

问题

Overall the paper is good, but the results specially on COCO seems a bit incremental.

评论

Thank you for your feedback.

Weakness 1

Results on real world datasets especially COCO in Table 1 are incremental. Not really clear how much advantage MSF adds.

We would like to clarify possible misunderstandings.

Techniques OGDR and MSF in Tab. 1 are added to SLATE separately, i.e., the table contains results for SLATE+OGDR and SLATE+MSF, not SLATE+OGDR+MSF together. Therefore, MSF improves SLATE 5-7 points in both ARI and ARIfg, that is, 24.18->30.95@ARI and 24.54->30.47@ARIfg.

Besides, the methods we cover are all strong baselines, and any improvement can be challenging. We exclude earlier methods like IODINE, MONet, SA and SAVi due to their lower accuracy. In OCL literature, ARI and ARIfg (2nd and 3rd columns in those tables) are widely adopted metrics, revealing more significant differences among methods.

Weakness 2

Analysis of varying object sizes can show results on scale understanding better than overall IoU improvement.

We follow COCO's small/medium/large size splits to evaluate our MSF's performance, as shown in the table below. MSF shows more improvement on small and medium objects. This demonstrates that the fusion among multiple scales effectively improves different-sized object representations.

Table 1. How MSF performs on different-sized objects. Dataset is COCO instance segmentation.

mIoU_SmIoU_MmIoU_L
SLATE8.5726.6534.57
SLATE+MSF12.6328.1434.83

Please note that the related contents have been updated into Sect. "A2. Extended Experiments" in our paper’s new version.

Weakness 3

Is there any intuition why the value of n is 3? Can we do a similar experiment on OpenImages subset [1] and see if this holds true across datasets?

We use input resolutions 128 and 256 to evaluate on OpenImages subset. Under resolution 128, we use nn=3 and 4; under resolution 256, we use nn=3, 4 and 5. Results are shown below. We use ARI+ARIfg as the metrics because ARI mostly reflect how well the background is segmented and ARIfg only measures foreground objects. According to the results, under resolution 128, n=3 is the best choice; under resolution 256, n=4 is the best choice. nn value is stable across datasets.

Table 2. Effects of input resolutions and #scales nn. Model is SLATE+MSF; dataset is OpenImages subset.

resolution128128256256256
nn34345
ARI+ARIfg62.0760.9364.8267.5862.75

Please note that the related contents have been updated into "A.2 Extended Experiments" of our paper’s new version.

Weakness 4

Discussion on the area of unsupervised semantic segmentation and object detection using SSL methods should be added. Papers like [1,2,3] should be discussed.

Firstly, relationships among SSL segmentation (SSLseg), OCL, as well as World Models (WMs) are as follows. (1) SSLseg focuses on extracting segmentation masks; (2) OCL represents each object as a feature vector, with segmentation masks as byproducts, kind of overlapping with SSLseg; (3) WMs, upon OCL, address downstream tasks like visual reasoning, planning and decision-making.

Briefly, OCL and SSLseg are designed for different purposes. OCL can be used directly to support WMs to work on visual tasks like reasoning, planning and decision-making; In contrast, SSLseg needs some intermediate step like OCL to support WMs on those advanced vision tasks.

Please read Fig. 4 in our paper's new version, which provides a nice picture to show the relationships intuitively.

Since SSLseg and OCL are designed for different purposes, direct comparisons are not commonly discussed in the OCL literature – We follow the evaluation protocols established by [4, 5, 6].

Please note that we have updated the related contents into our paper’s new version.

Reference

[1] Ge et al. Hyperbolic Contrastive Learning for Visual Representations beyond Objects. CVPR2023.

[2] Wang et al. VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation. CVPR 2024.

[3] Rambhatla. MOST: Multiple Object Localization with Self-Supervised Transformers for Object Discovery. ICCV 2023.

[4] Locatello et al. Object-Centric Learning with Slot Attention. NeurIPS 2020.

[5] Singh et al. Illiterate DALL-E Learns to Compose. ICLR 2022.

[6] Kipf et al. Conditional Object-Centric Learning from Video. ICLR 2022.

审稿意见
8

The paper proposes multi-scale fusion (MSF) for object-centric learning. Specifically, MSF extracts VAE representations corresponding to the same input, downsampled to varying resolutions. The output representations are then upsampled and downsampled as necessary and then combined as a way to augment each set of features. The augmented features can then be fed as input to the general OCL decoder (along with the slot attention outputs and the matching entries from the discrete codebook). This intends to allow for better, more separable, semantically distinct representations of objects at varying scales. The quality is demonstrated with varying metrics corresponding to unsupervised segmentation on the masks that are generated as a byproduct of this OCL pipeline.

优点

[S1] The method seems to consistently outperform competitors for the OCL benchmarks, and is even somewhat competitive with the foundation model baseline (DINOSAUR).

[S2] The approach is well-motivated.

[S3] Figure 3 effectively demonstrates, qualitatively, how the MSF accomplishes better OCL, that is, features that are better connected to the objects themselves.

缺点

[W1] The impact of the work seems limited. The ideas about multi-scale representation and fusion themselves are not novel (see for example FPN in object detection literature), but the implementation and application to this task are. However, the implementation neglects potentially more impactfully redesigns of the primary encoder-decoder pipeline or the slot attention mechanism itself.

[W2] Additionally, there are no results for any downstream tasks. Thus, while interesting, it is hard to project the impact of the work beyond this niche.

Minor: The notation and its interaction with Figure 3 is unpleasant and difficult to follow. In particular, the differences between scale-variant, scale-invariant, and representation before inter-scale fusion are quite slight and seem slightly arbitrary.

Minor: Algorithm 1 would fit in the main paper, and help add clarity in the flow.

问题

What downstream tasks could this method help with? It doesn't seem to be a SAM alternative, so what is the potential practical impact?

What would change were this applied to higher-resolution images? MSF would seem to have some promise for small objects, but those are essentially all filtered out by the pre-processing.

评论

Thank you for your positive feedback.

Weakness 1.1

The ideas of multi-scale are not novel (e.g., FPN), but implementation and application to OCL are.

We agree that multi-scale is a key challenge in computer vision.

Existing multi-scale methods (i) Either employ image pyramids without information fusion among pyramid levels, where inter-scale information is wasted; (ii) Or rely on feature pyramids (FPN and numerous variants), where features of different layers are fused by channel-concat or element-wise-sum, mixing both low- and high-quality representations together.

Although our main focus is on the OCL setting, our MSF is the first to enable multi-scale fusion on VAE representations. By leveraging codebook matching, we selectively fuse high-quality information among scales, rather than mixing them together. Our MSF is superior to channel-concat or element-wise-sum, as shown below.

Table 1. Effects of different fusion techniques among multiple scales. Model is SLATE; dataset is ClevrTex.

channel-concatelement-wise-sumMSF
ARI+ARIfg92.6389.18100.71

Please note that the related contents have been updated into Sect. "A.1 Extended Related Work" of our paper's new version.

Weakness 1.2

The implementation neglects potentially more impactfully redesigns.

We are open to including other impactful OCL designs if specific examples are provided. Regarding earlier works like IODINE, MONet, SA, and SAVi, we omitted them due to their lower performance.

Weakness 2

No results for downstream tasks.

Downstream tasks of OCL (+MSF) include (1) scene understanding, e.g., VisualGenome; (2) visual reasoning, e.g., Clevrer VQA; (3) visual planning, e.g., PHYRE and Physion; and (4) visual decision-making for reinforcement learning agents, e.g., video game playing and robotic manipulations.

OCL literature usually does not cover these tasks. However, we provide results for visual planning in PHYRE with a SlotFormer [2] World Model (WM) below.

Table 2. Visual planning on PHYRE [1]. OCL extracts slots; WM infers upon slots. AUCCESS [2] is the metric, the larger the better.

OCLSLATESLATE+MSF
WMSlotFormerSlotFormer
AUCCESS84.7389.95

Our MSF also has potential in visual generation. This is because we only modify VAE, the key module of visual generation like StableDiffusion for images and Sora for videos. However, this is a totally different research field, beyond our scope.

Weakness 3

Minor: The notation and its interaction with Figure 3 is difficult to follow. The differences between scale-variant, scale-invariant, and representation before inter-scale fusion are quite slight.

We have improved Fig. 3 presentation, especially the figure caption, in our paper’s new version. Please check the new pdf file for details.

And we are happy to improve further. Could you kindly provide specific examples?

Weakness 4

Minor: Algorithm 1 would fit in the main paper, and help add clarity in the flow.

Thank you for your kind advice. We have updated our paper’s new version accordingly.

Question 1

What downstream tasks could this method help with? It doesn't seem to be a SAM alternative, so what is the potential practical impact?

Segmentation models like SAM only extract masks, but cannot represent objects as corresponding feature vectors, i.e., slots. In contrast, OCL can do both.

With slots from OCL, an object-centric world model can be built for downstream tasks like visual understanding, reasoning, planning and decision-making [2]. Our MSF improves OCL, thus helps with all of these tasks.

Question 2

What would change were this applied to higher-resolution images? MSF would seem to have some promise for small objects.

We follow COCO's Small/Medium/Large size splits to evaluate our MSF's performance. Results of resolution 128x128 and 256x256 are shown below. Our MSF does show better performance on small objects; And switching to higher-resolution does not change the conclusion.

Table 3. How MSF performs on different-sized objects. Dataset is COCO instance segmentation.

resolution 128mIoU_SmIoU_MmIoU_L
SLATE8.5726.6534.57
SLATE+MSF12.6328.1434.83
resolution 256mIoU_SmIoU_MmIoU_L
SLATE13.6428.7635.37
SLATE+MSF16.0429.9835.88

Please note that the related contents are updated into Sect. "A.2 Extended Experiments" of our paper’s new version.

Reference

[1] Bakhtin et al. PHYRE: A New Benchmark for Physical Reasoning. arXiv:1908.05656.

[2] Wu et al. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. ICLR 2023.

评论

The response to weakness 1.1 seems to be a rephrasing of my original statement. The ablation in the new table is helpful, but only in a minor sense, since it doesn't affect the original criticism. Regarding 1.2, this is a fair point, and my original concern was overly speculative. Consider it withdrawn.

For weakness 2, I appreciate the comparison to SlotFormer for planning. While I would prefer a stronger focus on such comparisons (as claims about downstream tasks are empty without the concrete evidence, very often "the devil is in the details"), this does help address my concern somewhat.

For weakness 3, the notation for scale-invariant representations, scale-variant representations, and representations before inter-scale fusion appear unchanged, and this was the root of my initial complaint. To be clear, I'm talking about the notation, not the visual content of the figure.

For question 1, see weakness 2.

For question 2, the answer and expanded results are helpful.

评论

We thank all the reviewers for their insightful feedback and responding to our rebuttal.

We have now colored all changes with a blue color in the paper, where in the main content the updates mainly save space and improve the presentation while the appendix contains now extended related work and experiments suggested by the reviewers.

For those reviewers who did not yet have a chance to respond to our rebuttal, please let us know before the paper revision deadline if you would like us to do further improvements to the paper.

AC 元评审

This paper explores Object-Centric Learning (OCL), which seeks to capture comprehensive object information by leveraging intermediate representations from a Variational Autoencoder (VAE) to reconstruct inputs. The approach emphasizes multi-scale training, recognizing that objects in images or videos may appear at different scales due to variations in imaging distance or intrinsic size disparities.

The average score of this paper is 6.25 (8, 6, 6, 5), which is above the borderline. After carefully checking the response during rebuttal period and reading the paper, I decide to give the acceptance recommendation.

Note that the authors are required to update/revise/polish their paper according to reviewers' suggestions in the final version.

审稿人讨论附加意见

The average score of this paper is 6.25 (8, 6, 6, 5), which is above the borderline. Reviewer 8Piy points out that the obtained gains are small (1~2 points), however, without any further valuable reviews. This review should be ignored according to the policy of ICLR 2025.

After carefully checking the response during rebuttal period and reading the paper, I decide to give the acceptance recommendation.

最终决定

Accept (Poster)