c:["$","div",null,{"className":"container py-8 max-w-6xl mx-auto","children":["$","$e",null,{"fallback":null,"children":["$","$L16",null,{"paper":{"id":"KgN0mo6pLo","title":"Compositional Scene Modeling with An Object-Centric Diffusion Transformer","abstract":"$17","keywords":["Object-Centric Representation Learning","Unsupervised Learning","Compositional Scene Modeling","Diffusion Models","Generative Models"],"primary_area":"unsupervised, self-supervised, semi-supervised, and supervised representation learning","venue":"Submitted to ICLR 2025","conference":"ICLR","year":2025,"status":"rejected","is_accepted":false,"avg_rating":4.75,"avg_rating_normalized":4.75,"rating_min":3,"rating_max":6,"rating_std":1.08972,"review_count":4,"comment_count":8,"creation_date":"2024-09-25","modification_date":"2025-02-05","forum_link":"https://openreview.net/forum?id=KgN0mo6pLo","pdf_link":"https://openreview.net/pdf?id=KgN0mo6pLo","arxiv_id":null,"arxiv_url":null,"arxiv_match_method":null,"arxiv_matched_at":null,"tldr":"","created_at":"2026-01-21T12:25:36.965508+00:00","updated_at":"2026-04-22T06:52:52.190672+00:00","authors":[{"id":"~Zhimeng_Shen2","name":"Zhimeng Shen","openreview_id":"~Zhimeng_Shen2","position":0},{"id":"~Tonglin_Chen1","name":"Tonglin Chen","openreview_id":"~Tonglin_Chen1","position":1},{"id":"~Bin_Li4","name":"Bin Li","openreview_id":"~Bin_Li4","position":2},{"id":"~Xiangyang_Xue2","name":"Xiangyang Xue","openreview_id":"~Xiangyang_Xue2","position":3}]},"stats":{"ratings":[{"id":"goyFKyMQgK","value":5,"confidence":5},{"id":"vJp0GdGwFP","value":6,"confidence":4},{"id":"49BaWEfqCX","value":3,"confidence":3},{"id":"iO1Fqd6vXg","value":5,"confidence":3}],"avg_rating":4.75,"rating_min":3,"rating_max":6,"rating_std":1.2583057392117916,"detailed_scores":{"soundness":[2,2,3,3],"contribution":[2,2,2,2],"presentation":[3,2,2,2],"originality":[],"quality":[],"clarity":[],"significance":[]}},"commentTree":[{"id":"goyFKyMQgK","paper_id":"KgN0mo6pLo","replyto":"KgN0mo6pLo","number":1,"type":"Official_Review","role":"reviewer","rating":5,"confidence":5,"soundness":2,"contribution":2,"presentation":3,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":5,"summary":"This paper proposes CODiT, a DiT-based model for object-centric learning (OCL). CODiT adopts a post-decoder composition design, i.e., it runs a shared-weight DiT on each slot to predict a noise map and a mask in parallel. This is very different from recent works which use the pre-decoder composition (e.g., LSD, SlotDiffusion). In addition, the paper points out that such design is related to the classifier-free guidance (CFG) in diffusion model sampling. The authors conduct experiments on three synthetic and four real-world datasets to evaluate CODiT.","questions":"- The compositional generation results in Fig.7/8/9 are quite blurry compared to Fig.6/10. Can authors provide a clearer version? This is important as compositionality is a crucial aspect of OCL methods.\n- Line 365 says \"we use image-based SlotDiffusion for fair comparison\". What is the \"image-based\" version? SlotDiffusion always use a VQ-VAE and perform denoising in the latent space.","soundness":2,"strengths":"- As far as I know, this paper is the first to study individual object generation ability of OCL models. It surprising to see that prior works perform so badly in this aspect. It serves as a good motivation for this paper.\n- The analogy to CFG is a new perspective to OCL models, especially the fact that object masks enable different CFG values at different pixels. Indeed, this is very intuitive and reasonable in the context of object-centric representations.","confidence":5,"weaknesses":"- For the image reconstruction result in Sec. 4.4, why only reporting MSE? I think prior works also report LPIPS or FID, which are more aligned with human perception.\n- Another important aspect of OCL methods is the learned slot representations. While I would expect the representation quality to be good (given the single-object generation result), the paper should still show some results in this direction. For example, the object property prediction task in LSD, or the downstream VQA task in SlotDiffusion.\n- For the real-world image segmentation result in Appendix E, why is SlotDiffusion not included here? Its performance is clearly higher than CODiT + DINO. Also, please include COCO as it is also a standard benchmark in prior works.\n- The related work section is not comprehensive enough. OCL is a broad field, but this paper only has ~35 citations. Please refer to the related work section of LSD or SlotDiffusion to add more discussion about prior works.","contribution":2,"presentation":3,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-10-19T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

This paper proposes CODiT, a DiT-based model for object-centric learning (OCL). CODiT adopts a post-decoder composition design, i.e., it runs a shared-weight DiT on each slot to predict a noise map and a mask in parallel. This is very different from recent works which use the pre-decoder composition (e.g., LSD, SlotDiffusion). In addition, the paper points out that such design is related to the classifier-free guidance (CFG) in diffusion model sampling. The authors conduct experiments on three synthetic and four real-world datasets to evaluate CODiT.

","questions":"

The compositional generation results in Fig.7/8/9 are quite blurry compared to Fig.6/10. Can authors provide a clearer version? This is important as compositionality is a crucial aspect of OCL methods.
Line 365 says \"we use image-based SlotDiffusion for fair comparison\". What is the \"image-based\" version? SlotDiffusion always use a VQ-VAE and perform denoising in the latent space.

","strengths":"

As far as I know, this paper is the first to study individual object generation ability of OCL models. It surprising to see that prior works perform so badly in this aspect. It serves as a good motivation for this paper.
The analogy to CFG is a new perspective to OCL models, especially the fact that object masks enable different CFG values at different pixels. Indeed, this is very intuitive and reasonable in the context of object-centric representations.

","weaknesses":"

For the image reconstruction result in Sec. 4.4, why only reporting MSE? I think prior works also report LPIPS or FID, which are more aligned with human perception.
Another important aspect of OCL methods is the learned slot representations. While I would expect the representation quality to be good (given the single-object generation result), the paper should still show some results in this direction. For example, the object property prediction task in LSD, or the downstream VQA task in SlotDiffusion.
For the real-world image segmentation result in Appendix E, why is SlotDiffusion not included here? Its performance is clearly higher than CODiT + DINO. Also, please include COCO as it is also a standard benchmark in prior works.
The related work section is not comprehensive enough. OCL is a broad field, but this paper only has ~35 citations. Please refer to the related work section of LSD or SlotDiffusion to add more discussion about prior works.

","code_of_conduct":"

Yes

"}},{"id":"vJp0GdGwFP","paper_id":"KgN0mo6pLo","replyto":"KgN0mo6pLo","number":2,"type":"Official_Review","role":"reviewer","rating":6,"confidence":4,"soundness":2,"contribution":2,"presentation":2,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":6,"summary":"This paper introduces a new method to compose slot components which is different from common methods used in previous object-centric learning approaches. Traditionally, object-centric learning either uses separate decoders for individual slot components followed by mask-sum operation or a single diffusion decoder taking all slot components as conditionings to reconstruct input images. This paper combines the both and proposes to use separate diffusion decoders with shared parameters for individual slot components followed by mask-sum operation in the score (noise prediction) domain. The paper claims that their approach, CODiT, works better than traditional mask-sum approaches for complex scenes and outperforms traditional diffusion decoder approaches in terms of slot representation interpretability and segmentation performance.","questions":"Fig 6 actually shows some interesting results. Do you train a diffusion decoder or use a pretrained diffusion model like in LSD?","soundness":2,"strengths":"1. The paper points out an interesting finding from state-of-the-art approaches LSD [1] and SlotDiffusion [2], i.e., the decoded component of a single slot doesn't essentially correspond to a semantic local part (object or facial feature) of input image. As a result, the slot representations learned with these approaches lack interpretability for human understanding.\n\n2. The paper proposes a straightforward approach that can solve this problem while still allows preserving the powerful decoding ability of diffusion models.\n\n3. The paper provides extensive experiments to support the claims and show better segmentation and interpretability performance.\n\nReferences\n\n[1] Jiang, \"Object-Centric Slot Diffusion\", NeurIPS 2023\n\n[2] Wu, \"Slotdiffusion: Object-centric generative modeling with diffusion models\", NeurIPS 2023","confidence":4,"weaknesses":"$18","contribution":2,"presentation":2,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-10-30T00:00:00+00:00","modified_at":"2024-11-25T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

This paper introduces a new method to compose slot components which is different from common methods used in previous object-centric learning approaches. Traditionally, object-centric learning either uses separate decoders for individual slot components followed by mask-sum operation or a single diffusion decoder taking all slot components as conditionings to reconstruct input images. This paper combines the both and proposes to use separate diffusion decoders with shared parameters for individual slot components followed by mask-sum operation in the score (noise prediction) domain. The paper claims that their approach, CODiT, works better than traditional mask-sum approaches for complex scenes and outperforms traditional diffusion decoder approaches in terms of slot representation interpretability and segmentation performance.

","questions":"

Fig 6 actually shows some interesting results. Do you train a diffusion decoder or use a pretrained diffusion model like in LSD?

","strengths":"

\n
The paper points out an interesting finding from state-of-the-art approaches LSD [1] and SlotDiffusion [2], i.e., the decoded component of a single slot doesn't essentially correspond to a semantic local part (object or facial feature) of input image. As a result, the slot representations learned with these approaches lack interpretability for human understanding.
\n
\n
The paper proposes a straightforward approach that can solve this problem while still allows preserving the powerful decoding ability of diffusion models.
\n
\n
The paper provides extensive experiments to support the claims and show better segmentation and interpretability performance.
\n

References

[1] Jiang, \"Object-Centric Slot Diffusion\", NeurIPS 2023

[2] Wu, \"Slotdiffusion: Object-centric generative modeling with diffusion models\", NeurIPS 2023

","weaknesses":"$19","code_of_conduct":"

Yes

"}},{"id":"49BaWEfqCX","paper_id":"KgN0mo6pLo","replyto":"KgN0mo6pLo","number":3,"type":"Official_Review","role":"reviewer","rating":3,"confidence":3,"soundness":3,"contribution":2,"presentation":2,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":3,"summary":"The paper introduces CODiT, an object-centric learning framework that incorporates a post-decoder compositional diffusion network to improve the interpretability and generation capabilities of scene modeling. CODiT leverages a compositional denoising approach where individual object representations are denoised separately and integrated compositionally. This is differences from the existing pre-decoder methods that struggle with interpretability. The method demonstrates favorable performance in segmentation, reconstruction, and object-editing tasks.","questions":"### Comment\n\nThe main contributions that set CODiT apart from previous methods are not emphasized enough. Including a dedicated paragraph that explicitly contrasts CODiT with prior works would enhance readability and clarify its unique contributions. Currently, the advantages of CODiT are scattered across different sections, making it challenging to discern what specifically differentiates it from recent approaches.","soundness":3,"strengths":"-\tCODiT’s integration of compositional modeling within a diffusion framework show a meaningful improvement in object-centric learning.\n\n-\tHuman Intuition Alignment: The post-decoder compositional design aligns more closely with human visual scene processing.\n\n-\tThe model’s performance on evaluation datasets (CLEVRTEX, OCT-B, and FFHQ) validate the methods applicability in both synthetic and real data.","confidence":3,"weaknesses":"$1a","contribution":2,"presentation":2,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-11-04T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[{"id":"2l5lqxcBNa","paper_id":"KgN0mo6pLo","replyto":"49BaWEfqCX","number":11,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"[ACTION NEEDED] Respond to author rebuttal","comment":"Dear Reviewer,\n\nNow that the authors have posted their rebuttal, please take a moment and check whether your concerns were addressed. At your earliest convenience, please post a response and update your review, at a minimum acknowledging that you have read your rebuttal.\n\nThank you,\n--Your AC"},"created_at":"2024-11-27T00:00:00+00:00","modified_at":"2024-11-27T00:00:00+00:00","replies":[],"contentHtml":{"title":"

[ACTION NEEDED] Respond to author rebuttal

","comment":"

Dear Reviewer,

Now that the authors have posted their rebuttal, please take a moment and check whether your concerns were addressed. At your earliest convenience, please post a response and update your review, at a minimum acknowledging that you have read your rebuttal.

Thank you,\n--Your AC

"}}],"contentHtml":{"summary":"

The paper introduces CODiT, an object-centric learning framework that incorporates a post-decoder compositional diffusion network to improve the interpretability and generation capabilities of scene modeling. CODiT leverages a compositional denoising approach where individual object representations are denoised separately and integrated compositionally. This is differences from the existing pre-decoder methods that struggle with interpretability. The method demonstrates favorable performance in segmentation, reconstruction, and object-editing tasks.

","questions":"

Comment

The main contributions that set CODiT apart from previous methods are not emphasized enough. Including a dedicated paragraph that explicitly contrasts CODiT with prior works would enhance readability and clarify its unique contributions. Currently, the advantages of CODiT are scattered across different sections, making it challenging to discern what specifically differentiates it from recent approaches.

","strengths":"

\n
CODiT’s integration of compositional modeling within a diffusion framework show a meaningful improvement in object-centric learning.
\n
\n
Human Intuition Alignment: The post-decoder compositional design aligns more closely with human visual scene processing.
\n
\n
The model’s performance on evaluation datasets (CLEVRTEX, OCT-B, and FFHQ) validate the methods applicability in both synthetic and real data.
\n

","weaknesses":"$1b","code_of_conduct":"

Yes

"}},{"id":"iO1Fqd6vXg","paper_id":"KgN0mo6pLo","replyto":"KgN0mo6pLo","number":4,"type":"Official_Review","role":"reviewer","rating":5,"confidence":3,"soundness":3,"contribution":2,"presentation":2,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":5,"summary":"This paper presents a compositional approach for learning interpretable object representations, which can then be used for object editing in images or for unsupervised segmentation. The method is similar to SlotDiffusion, but CODiT explicitly models object masks during the diffusion denoising stage. Results show improvements compared to SlotDiffusion and LSD on object editing and unsupervised segmentation tasks, as well as compared to DINOSAUR (another recent method) on segmentation tasks.","questions":"1. I see prior methods have also reported results on COCO, which is likely to be more challenging than PASCAL. Have you considered reporting results on COCO, or would it be possible to report results on it?\n2. I’d request the authors to address my concerns in the weaknesses.\n3. In particular, I’d like to see a comparison to Kakogeorgious et al.","soundness":3,"strengths":"* I appreciate that the authors shared the model code in the supplementary (though I haven’t had a thorough look at it)\n* Figure 1 nicely lays out the difference between this approach and many prior approaches\n* Results look good compared to the baselines presented in the paper","confidence":3,"weaknesses":"$1c","contribution":2,"presentation":2,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-11-06T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[{"id":"Yr2hVAW61E","paper_id":"KgN0mo6pLo","replyto":"iO1Fqd6vXg","number":12,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"[ACTION NEEDED] Respond to author rebuttal","comment":"Dear Reviewer,\n\nNow that the authors have posted their rebuttal, please take a moment and check whether your concerns were addressed. At your earliest convenience, please post a response and update your review, at a minimum acknowledging that you have read the rebuttal.\n\nThank you,\n--Your AC"},"created_at":"2024-11-27T00:00:00+00:00","modified_at":"2024-12-02T00:00:00+00:00","replies":[],"contentHtml":{"title":"

[ACTION NEEDED] Respond to author rebuttal

","comment":"

Dear Reviewer,

Thank you,\n--Your AC

"}}],"contentHtml":{"summary":"

This paper presents a compositional approach for learning interpretable object representations, which can then be used for object editing in images or for unsupervised segmentation. The method is similar to SlotDiffusion, but CODiT explicitly models object masks during the diffusion denoising stage. Results show improvements compared to SlotDiffusion and LSD on object editing and unsupervised segmentation tasks, as well as compared to DINOSAUR (another recent method) on segmentation tasks.

","questions":"

I see prior methods have also reported results on COCO, which is likely to be more challenging than PASCAL. Have you considered reporting results on COCO, or would it be possible to report results on it?
I’d request the authors to address my concerns in the weaknesses.
In particular, I’d like to see a comparison to Kakogeorgious et al.

","strengths":"

I appreciate that the authors shared the model code in the supplementary (though I haven’t had a thorough look at it)
Figure 1 nicely lays out the difference between this approach and many prior approaches
Results look good compared to the baselines presented in the paper

","weaknesses":"$1d","code_of_conduct":"

Yes

"}},{"id":"7Ha5HPtzXM","paper_id":"KgN0mo6pLo","replyto":"KgN0mo6pLo","number":1,"type":"Meta_Review","role":"area_chair","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"metareview":"This paper proposes a novel decoder for object-centric models using a compositional diffusion approach. While the approach is novel, the reviewer consensus was that the paper is not ready for publication. The authors are encouraged to take the reviewer feedback into account should they prepare a resubmission at a future venue.","additional_comments_on_reviewer_discussion":"No reviewer was willing to champion the paper for acceptance."},"created_at":"2024-12-19T00:00:00+00:00","modified_at":"2025-02-05T00:00:00+00:00","replies":[],"contentHtml":{"metareview":"

This paper proposes a novel decoder for object-centric models using a compositional diffusion approach. While the approach is novel, the reviewer consensus was that the paper is not ready for publication. The authors are encouraged to take the reviewer feedback into account should they prepare a resubmission at a future venue.

","additional_comments_on_reviewer_discussion":"

No reviewer was willing to champion the paper for acceptance.

"}},{"id":"Z0Ag6JYzrz","paper_id":"KgN0mo6pLo","replyto":"KgN0mo6pLo","number":1,"type":"Decision","role":"program_chair","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Paper Decision","comment":"","decision":"Reject"},"created_at":"2025-01-22T00:00:00+00:00","modified_at":"2025-02-05T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Paper Decision

","decision":"

Reject

"}}],"submissionHistory":[]}]}]}]