PaperHub
5.3
/10
Rejected4 位审稿人
最低5最高6标准差0.4
5
6
5
5
3.5
置信度
ICLR 2024

Explicitly Disentangled Representations in Object-Centric Learning

OpenReviewPDF
提交: 2023-09-24更新: 2024-02-11

摘要

关键词
object-centric representation learningunsupervised learningdisentanglementcomputer vision

评审与讨论

审稿意见
5

The authors develop a network architecture to explicit extract shape and texture information from images, thus extending an approach to extract position, orientation and scale. The architecture is specific to the problem at hand, removing texture information to help extract shape and using shape-derived masks to extract texture information. The benefit of the disentangled representation is demonstrated in both scene understand tasks and image reconstruction tasks.

优点

  • Great results on image reconstruction tasks, good results on scene understanding tasks, good results on qualitative measures.
  • Improved latent representations are amendable for better interpretation.

缺点

  • The architecture is explicitly designed for shape and texture and cannot be extended to other latent dimensions, thus limiting the significance of the results. This represents a significant effort which is ad-hoc and hence should be saved only to crucial latent dimension; not sure texture is such.
  • Texture decoding uses shape information, which reduces the applicability of the method outside the context of the current projectt.
  • Shape decoding uses texture removing filter, which may by itself be responsible for some of the benefits of the approach.

问题

  • Why was rotation invariance removed from the experiments?
  • Is the new shape and texture dimensions disentangled from position and scale information? To what extent were position and scale varied in the experiments?
评论

Thank you for reviewing our work and providing helpful feedback.

  • Concerning the first weakness point, we think that explicitly designing the architecture for shape and texture disentanglement does not prevent it from being extended to other latent dimensions. For example, by employing the Invariant Slot Attention mechanism instead of the base Slot Attention mechanism (as we did in our work), it is possible to extend the latent dimensions to scale and position components. It is also feasible to do the same for rotation features by including the rotation factors in the ISA mechanism. Moreover, we think that texture is indeed one of the main properties describing an object, along with others such as position, size, shape, and orientation. Furthermore, assuming to have a strategy to extract additional object information, it would be possible to concatenate it to the texture and shape representations before the decoding phase to exploit it.

  • For the second weakness point, we are not sure we understand the meaning of it. In fact, all SA-based models use both shape and texture information to decode the object textures: usually, they feed the complete slot vectors (encoding all the object information, texture and shape included) to the decoder. Could we please ask for a clarification on it? Thank you very much in advance.

  • Regarding the last weakness pointed out, the Sobel filter is indeed responsible for part of the benefits of our approach. To analyze this, we decided to provide an ablation (Appendix F.2) on its impact in disentangling shape and texture. However, we do not consider this point to be a weakness, but a modeling choice. It would also be possible to replace the Sobel filter with a different filter, for instance, a more sophisticated one that could lead to improved performance.

  • To answer the first question, we did not use the rotation factors in DISA as, on Tetrominos and CLEVR, ISA performed better without it. We extended this decision to the other datasets as well.

  • Finally, we address your last question by including a qualitative analysis of the position and scale disentanglement (Appendix F.1 Figure 13). The results suggest that this property is indeed successfully achieved. Regarding the extent to which position and scale were varied in the experiments: we trained on the complete training sets, where the objects can be found in approximately every location, while the scales are dataset-dependent. On Tetrominoes, a single fixed size is present, while Multi-dSprites has 6 values linearly spaced in [0.5, 1], where 1 represents a given default object size. CLEVR has 2 sizes (small and large) while CLEVRTex has 3 (small, medium, and large). However, for these two datasets, the proximity of an object to the camera affects its size in the image.

We hope to have clarified all your doubts and addressed your concerns.

评论

Thank you for taking the time to improve the manuscript based on the comments.

My main worry is that making a network invariant to a new factor X is an ad-hoc process which requires explicit architecture which is tailored for X (the first weakness in my comments) and would use specific information which is useful to achieving X-invariance (the next weaknesses in my comments). This is fine for your project but doesn't scale: it's highly unlikely we'll get in ten years an architecture invariant to 10 different factors using this approach.

评论

Thank you for the clarification.

We understand your concern expressed in "making a network invariant to a new factor X is an ad-hoc process which requires explicit architecture which is tailored for X...". However, we would like to respectfully point out that by following this argument to reject a paper, prior valuable contributions (such as [1][2][3]) could have been dismissed as well. In our opinion, employing these inductive biases to explicitly disentangle certain object properties does not imply that all other properties should be disentangled following the same (explicit) approach. For instance, to disentangle the sub-properties of a texture (color, material, ...) without assuming to know them a priori, an interesting direction would be to couple DISA with some hypothetical method aiming to separate generic factors of variation (e.g., beta-VAE in the context of probabilistic models). Therefore, we think this work can still provide a valuable contribution toward more interpretable, structured, and robust representations, without precluding the possibility of combining it with non-explicit disentanglement methods targeting different properties.

[1] Disentangling 3D Prototypical Networks For Few-Shot Concept Learning, https://arxiv.org/abs/2011.03367

[2] Unsupervised Part-Based Disentangling of Object Shape and Appearance, https://arxiv.org/abs/1903.06946

[3] Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames, https://arxiv.org/abs/2302.04973

审稿意见
6

Edit: I believe that the revised paper is substantially improved due to the more thorough ablations and experiments, so I have raised my score to 6.

The authors propose to explicitly disentangle texture and shape information in slot attention by structuring the architecture in a specific way that encourages this. The image is passed through an edge detection filter to remove part of the texture information, which is then used to predict only the per-slot masks. The texture model takes the normal image as input and predicts the usual per-slot images, conditioned on the information from the shape encoder. Multiplying the two results in the usual slot attention reconstruction.

优点

The motivation for the variety of modeling decisions is expressed clearly. The experimental evaluation is thorough with many tests to evaluate specific contributions of the method. I especially like the disentanglement analysis experiments that confirm what type of information is represented by what encoder. It is encouraging to see that this decomposition appears to even improve results, a difference to many disentanglement methods that limit capacity instead.

缺点

The main weakness for me is the lack of consideration for the applicability on real world data. All the experiments are performed on tasks where a simple notion of texture and shape is possible. This is especially the case of approaches like DINOSAUR, where a simple Sobel filter will likely not work on the higher-dimensional features. I am therefore worried that the proposed approach is only useful in the toy examples studied, but not in more realistic use cases.

The approach proposed in the paper only biases, not enforces, the two models to focus on shape and texture information separately. In the case of the even just slightly more complex CLEVR dataset, we already see that the disentanglement results are degrading (end of section 5.2).

问题

  • It would be good to see ablation tests on exactly how much of an impact the quality of the edge filter has (for example, how well does it still disentangle if you don't use any edge filter at all?), or an experiment that uses slightly more complex textures (ClevrTex for example). For me, this is the most critical point since it pertains to how well we can expect it to work with realistic data; if it can be addressed I am willing to change my score to an accept.
  • In the generative experiments, what happens when you sample shapes instead of textures?
  • It is not clear to me if basing the approach on the ISA approach instead of base SA is necessary at all, or simply a modeling choice
评论

Thank you for your helpful review and for raising relevant concerns.

  • As for the first question, we agree that a study regarding the impact of the Sobel filter on the disentanglement of texture and shape is valuable. Therefore, we included it in Appendix F.2. In that section, we also similarly analyze the importance of the variance regularization on the disentanglement. Since we also concur with the importance of experimenting with a dataset with more complex textures, we trained on CLEVRTex and presented the results in the updated version of the paper.

  • To answer the second question, we included in Appendix F.4 Figure 34 shape generation results. It is important to know that this capability is strongly limited by the nature of the datasets we trained on, as all of them present very little variability in terms of object shapes.

  • Regarding the last question, basing DISA on ISA instead of SA is not strictly necessary. When using SA instead of ISA, position and scale information get included in texture and/or shape components, without preventing a quantitative study of the texture and shape disentanglement. However, there are two main reasons for choosing to employ ISA. (1) It enables a more efficient training process and (2) allows disentangling position and scale. In particular, the latter helps conduct the compositional and generative experiments. Without this property, during the texture transfer experiments, we may end up transferring between objects even position and scale information (which is undesired). Similarly, we may alter the position and scale of objects while generating new shapes.

We hope to have thoroughly addressed your concerns.

评论

I am happy with the revised version of the paper, it is good to see the new ablations and ClevrTex experiment. It appears that ClevrTex is still difficult, but I am okay with accepting that as the current state of research, the method does not need to be perfect at everything to be useful.

评论

Thank you very much for the positive response to the revised paper.

CLEVRTex is indeed still difficult to tackle, with both DISA and its baselines. We are confident that future works will be able to address the remaining problems and obtain precise masks, reconstructions, and disentanglement even on such complex datasets.

Please let us know if you have any additional questions or concerns, as we would be happy to answer.

Thank you again for your time.

审稿意见
5

This work proposes an extension to previous Slot Attention and the subsequent Invariant Slot Attention algorithms for object discovery and representation factorization. Specifically, the proposed formulation incorporates the explicit goal of disentangling shape and texture into separate latent components. This is achieved through constructing two streams of data, one specifically devoid of texture/color information through the application of a sobel filter. The feature representation from the latter is considered to be shape-only, with the assumption that the representation learned from the remaining stream should encompass everything else (textures and colors).

优点

  • It's a bit surprising to see that a weak application of the equivariant constraints still works sufficiently well here.
  • Writing was generally clear, especially with regard to its relation to Slot Attention and Invariant Slot Attention

缺点

  • Works such as Lorenz et al. seem more closely related to this work than stated in the related works section. Specifically, the task of unsupervised landmark detection is nearly identical to the object-centric learning framework. Furthermore, both works are based on the same two-stream information bottleneck setup where you apply some sort of transformation to ensure either shape or texture information is only available in one stream but not the other. I think it would be necessary for the authors to elaborate further in the paper on the relation to works in that section of the writing.
  • I think one of the strengths of the prior work ISA was the incorporation of more realistic settings such as Objects Room, CleverTex, and WaymoOpen. However, the results in this work seem limited to simpler synthetic datasets with only very smooth almost mono-color textures. I think it would be necessary to demonstrate improved results on at least some of those harder datasets to convincingly show that the solution proposed here is a step in the direction of real-world applicability.

问题

  • Are there known limitations to using the sobel filter? How does the sobel filter setup perform on high frequency texture patterns?
  • Can the authors also clarify the exact formulation for the variance regularization loss? Is the variance computed across dimensions like layernorm? or more like batch norm and across batch elements?
评论

Thank you for your review and for raising valuable concerns.

  • We worked to address the weaknesses you identified. The new version of the paper now includes an additional section (Appendix C) in which we compare in more detail two approaches (already mentioned in the related work) with DISA. Specifically, you can find a more exhaustive discussion on Lorenz et al. [1] in the second paragraph.

  • We further included CLEVRTex in our experiments to evaluate the performances of DISA in a more complex and realistic setting. As reported in the updated version of the paper, DISA outperforms SA and ISA on object discovery (BG and FG ARI) and has very similar MSE to ISA. Since we detect objects solely on the filtered image while ISA does so on the original one, the ARI score results suggest that, even on CLEVRTex, the Sobel filter does not pose a limit in this regard.

  • Concerning your first question, there are some limitations to the Sobel filter, such as the production of thick edges. Another limitation is that, as it is designed to approximate horizontal and vertical edges, it can be inaccurate with diagonal ones (e.g., very high gradients). A more sophisticated filter could, of course, be beneficial to DISA. In Appendix A, you can find an explanation of the Sobel filter, including a visual example (Figure 6) of how the Sobel filter works on all the considered datasets (CLEVRTex included, to show it on high-frequency texture patterns).

  • Regarding your question about the variance regularization, we compute the variance across all the objects in the batch images. Precisely, for each slot component, we calculate its variance across all the slots in the batch. This is mentioned below Equation 7 with the sentence "When the loss is computed over a batch of images, NsN_s becomes NsN_s times the number of images in the batch".

We hope we addressed all your concerns.

[1] Unsupervised Part-Based Disentangling of Object Shape and Appearance, https://arxiv.org/pdf/1903.06946.pdf

评论

The current draft is significantly improved over the original submission, in particular the inclusion of at least one more realistic dataset as well as further clarifications in details.

With regards to the additional datasets, I would still have liked to see the other two datasets as well (WaymoOpen and ObjectsRoom) -- the primary reason being concern over the possibility that we may be over-fitting our solutions to synthetic datasets of simple 2D and 3D shapes. The FG and BG-Ari results on ClevrTex is a nice result but the MSE suggests to me that, at the very least, we cannot conclude we have better disentanglement than in ISA. Further, it's still not clear to me whether the FG/BG-Ari performance will continue to outperform that of baseline approaches as the images get increasingly more realistic.

评论

Thanks for your response to our comment.

Regarding your concern about the absence of experiments on WaymoOpen and ObjectsRoom: we honestly believe that, with limited resources, including all the datasets you mentioned is unfeasible in a 12-days-long rebuttal. Experimenting on CLEVRTex has already been a great effort considering 3 resource-consuming repetitions for each of the studied models (SA, ISA, and DISA) with 11 slots each. Moreover, CLEVRTex is currently one of the most complex benchmarks in object-centric learning, and we achieved results aligned with ISA (with CNN backbone and trained for 500K steps in the original paper) after only 150K training steps.

Concerning your sentence "The FG and BG-Ari results on ClevrTex is a nice result but the MSE suggests to me that, at the very least, we cannot conclude we have better disentanglement than in ISA.", we find ourselves in disagreement. Looking at disentanglement results, both quantitative and qualitative, it is clear that DISA achieves some degree of texture and shape disentanglement. For example, Figure 27 shows that we are able to swap the textures of a white cube and a grey cylinder. Even if the MSE is not perfect, and thus the textures are not as detailed as in the input image, we would not be able to obtain an identical grey cube and a white cylinder without some extent of disentanglement. At the same time, as shown in Figure 13, we maintain ISA's position and scale disentanglement property. ISA, on the contrary, does not work towards the disentanglement of texture and shape, thus, as the base SA, does not disentangle those factors (we actually do not even know where those factors could be encoded in the latent space). We can therefore quite confidently assume that we have an overall better disentanglement than ISA. Finally, the marginally worse MSE of DISA compared to ISA on CLEVRTex is most probably not very meaningful. In fact, for simplicity, we kept the same division of texture and shape features (32 and 32) on all datasets. However, as the CLEVRTex textures are far more complex, they could require a larger number of components to match or surpass ISA in reconstruction quality (as in the other datasets). This experiment could have been carried out, but as stated in the paper "Note that this work seeks to achieve the desired disentanglement within the latent space of DISA rather than focusing on obtaining state-of-the-art results in unsupervised object discovery and reconstruction quality." our goal is not to obtain SOTA and hence we considered it to be of marginal importance.

Finally, as for the last point, we cannot of course say with absolute certainty that DISA would keep outperforming the baselines on increasingly more complex datasets. However, the additional experiment on CLEVRTex seems to suggest that we may observe such a trend. In fact, the FG-ARI improvement of DISA over SA and ISA is significantly larger on CLEVRTex than CLEVR6 (23% and 11% vs 5% and 3%).

We hope to have addressed all your concerns. Let us know if you have any other questions.

评论

Thanks for pointing out the additional results in figure 13! I will take that into account when making the final rating.

审稿意见
5

The paper proposes to explicitly disentangle shape and texture of objects, while training using an unsupervised autoencoding loss. The paper shows that this disentanglement can achieve competitive performance at object discovery while achieving significantly higher performance on image reconstruction. Further the paper shows that their representations are indeed disentangled by doing a property prediction task.

优点

  • the motivation of disentangling shape and texture seems promising
  • the paper proposes a unique approach for disentangling texture from shape using Sobel Filter.
  • the paper has good and dense comparisions in its experiment section
  • the paper does a good job at presentation and figures.

缺点

  • the paper motivates the intro with better representation, generalization and downstream transfer by having an explicit disentangled representation, however doesn't compare with ISA or SA with that metric.
  • The paper relies on sobel filter to achieve this disentaglement but doesn't do a good job at explaining how exactly does the Sobel filter work and is able to remove the texture. Is sobel filter limited to CLEVR images for removing filter or can scale to COCO like images?
  • The paper says: "However, to the best of our knowledge, no research has been carried out on the explicit disentanglement of the texture and shape dimensions in object-centric learning,", i don't think this is true as there is work such as: D3DP (https://arxiv.org/abs/2011.03367), that do disentangle shape and texture explicitly in an object-centric manner. They use adaptive instance norm instead of using Sobel filter.

问题

  • How does the paper compare against ISA or SA in terms of Figure 2 or other downstream tasks such as the one considered in this paper https://arxiv.org/pdf/2305.11281.pdf?
  • What is the intuitive idea + the math behind sobel filter? Would such Sobel Filter generalize to real world texture like the one seen in COCO?
  • I think this authors should discuss D3DP in their paper. Also compare against AdIN, as a way of disentangling shape and texture instead of using Sobel Filters.

I'm happy to change my reviewer if the authors address my concern.

评论

Thank you for your insightful review and for raising fair concerns.

  • Regarding the first question and weakness, we indeed think that generalization and downstream transfer are the principal motivations for investigating this direction. However, we do not study the benefits of using the representations learned by our model on downstream tasks as we think it would require a separate paper. In fact, to answer such a research question, we cannot rely on a single experiment such as the property prediction used in our work. We think it would be necessary to conduct an extensive study on more diverse tasks than simply property prediction, including for instance visual question answering and RL-related problems. In our opinion, this study should not be simply included in the appendix of a paper but be the core of a distinct one.

  • To address the second question, we included a dedicated section in the appendix (Appendix A). Here, we present the math and intuitive idea behind the Sobel filter, other than examples of its application to all the considered datasets. We also included the more real-world dataset CLEVRTex, as we additionally trained on it and presented the results in the updated version of the paper.

  • Concerning the last question and weakness pointed out, we indeed missed D3DP, which tackles the explicit disentanglement of shape and texture in an object-centric manner. However, they rely on ground-truth bounding boxes to decompose a scene into objects. Our method, instead, is completely unsupervised. Therefore, we changed that sentence to "However, to the best of our knowledge, no research has been carried out on the explicit disentanglement of the texture and shape dimensions in unsupervised object-centric learning, especially with non-probabilistic models.". Moreover, we included an additional section (Appendix C) in which, in the first paragraph, we compare DISA with D3DP. Finally, as for the comparison with AdaIn, we agree that it would be interesting to conduct an experiment using it when decoding textures. However, we unfortunately decided not to pursue it in favor of using our reserved computational resources for experimenting on CLEVRTex (i.e. extremely resource-consuming) and providing an ablatation study on the importance of variance regularization and Sobel filter for the disentanglement.

We hope we were clear and exhaustive with our answers and additional content in the paper.

评论

Dear Reviewer Acn7,

Thank you again for your constructive comment.

As we are very close to the end of the discussion period, we would like to ask if all the concerns have been addressed and, in that case, to kindly remind you to adjust the rating accordingly (if needed). Otherwise, if you have additional questions, we are happy to answer.

Thank you very much in advance.

评论

I went through the rebuttal, however I think in the current form, the paper is not ready for acceptance.

I think the paper lacks (as i have mentioned in my initial response):

(i) Analysis on the Scalability of Sobel Filter (comparision to other ways of disentangling shape from texture eg: ADAIN)

ii) Downstream comparision wrt other object-centric learning methods such as SloT Attention and ISA

Without these experiments the usefulness of the proposed method is unclear to me. The authors in the rebuttal think the comparision with these appraoches deserves a seperate paper, however i disagree.

评论

We would like to thank all the reviewers for dedicating time to our work and providing valuable feedback.

We uploaded a revised version of the paper. Here is a summary of the changes we included:

  • Training and evaluation on CLEVRTex;
  • Ablation on the impact of variance regularization and Sobel filter on texture and shape disentanglement (Appendix F.2 Figure 14);
  • Section dedicated to the Sobel filter math, intuition, and application on the considered datasets (Appendix A);
  • Additional discussion on two closely related works (Appendix C);
  • Shape generation results (Appendix F.4 Figure 35);
  • Qualitative results for position and scale disentanglement (Appendix F.1 Figure 13).

Thank you again.

评论

Dear AC, dear reviewers,

We would like to kindly express our concern over the fact that, as we are entering the last 24 hours of rebuttal, two reviewers out of four are still inactive. As we worked hard to address your concerns in the past two weeks, it would be unfortunate to miss the chance to receive helpful feedback on it and improve the manuscript. Therefore, with this message, we gently ask reviewers Acn7 and Vx4x to engage in the discussion so we can make the best out of this last day of rebuttal.

Thank you very much in advance.

AC 元评审

This paper builds on recent work for unsupervised scene decomposition (Invariant Slot Attention) and proposes an extension that explicitly disentangles texture and shape of individual scene elements (e.g. objects). Benefits are reported on (unsupervised) segmentation performance, reconstruction accuracy and on a property readout task on synthetically rendered multi-object datasets.

This paper is well-integrated into a line of work that has recently attracted a lot of attention in the community (unsupervised scene decomposition / object-centric learning). Integration of ideas from the disentanglement literature is novel and interesting, and the reviewers appreciate that the paper is overall well-written and that the proposed approach is creative.

Several concerns were highlighted by the reviewers: first and foremost, it is unclear how the method would generalize beyond synthetic data scenarios. While CLEVRTex is a good starting point as it contains textures, the field has moved far beyond these datasets however and some form of validation on real-world data should be expected these days, especially to confirm that the proposed scheme does not rely on specific intricacies of synthetic datasets.

Other concerns revolved around the generalizability/scalability of the Sobel filter utilized in this work and coverage of related work. These concerns were partially addressed in the rebuttal.

Overall, I believe that this paper currently does not meet the bar for acceptance at ICLR, but I encourage the authors to take the reviewer feedback into consideration when preparing a revised submission of this work to a future conference or journal.

To improve the paper, I would further recommend utilizing stronger vision backbones (instead of small CNNs), which would already improve the CLEVRTex results significantly (see results in Invariant Slot Attention). It is currently unclear whether the observed benefits would transfer to stronger vision backbones.

为何不给更高分

Insufficient experimental evaluation.

为何不给更低分

N/A

最终决定

Reject