/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Decomposition of Graphic Design with Unified Multimodal Model

Hui Nie,Zhao Zhang,Yutao Cheng,Maoke Yang,Gonglei Shi,Qingsong Xie,Jie Shao,Xinglong Wu

OpenReview PDF

提交: 2025-01-23更新: 2025-08-11

摘要

关键词

Layer DecompositionGraphic DesignUnified Multimodal Model

评审与讨论

审稿意见

评分: 42025-03-12

The paper proposes a particular method for layerwise decomposition of graphic designs into RGBA images which can be stacked and composed to reform the original image. The authors train a transparency aware RGBA VQ-GAN encoder-decoder. The encoder is used to encode the input graphic design images for a LLM based decomposition/derendering model. The LLM predicts the JSON encoded structure of each layer in the graphic design, directly outputting text content for text layers and the VQ-GAN tokens for image layers. The VQ-GAN decoder translates these image tokens back into pixel-space. Training of the LLM is primarily done on a proprietary dataset of 200k designs. But evaluation is done on the publicly available Crello dataset. Experiments show the model outperforming a simple (but creative!) baseline. A small number of qualitative results are provided and ablation studies show the importance of a number of designs decisions for the VQ-GAN.

给作者的问题

Is eq. 1 correct? Should it be argmax rather than argmin, given that presumably we want to maximize the IoU over all possible permutations?
Related, where is $Loc_{\alpha}$ in Table 2 defined? I assume it is the metric from eq. 1 but that merely defines $\hat{\alpha}$ and not $Loc_{\alpha}$ .
Did the authors consider using an off the shelf layered graphic design representation e.g. SVG, HTML, etc. rather than the custom RGBA based representation used in the paper?
It was not 100% clear to me how ImageNet or LAION were used for training, presumably this was only for the VQ-GAN as these datasets do not have the layered structure required for the LLM decomposition training. Could you confirm my understanding is correct and only the 200k posters dataset was used in the LLM training.

论据与证据

See other sections, the claims are primarily empirically justified.

方法与评估标准

A single evaluation dataset Crello is used, perhaps additional dataset could have been added to the evaluation.
The metrics used for evaluation are rather limited. The FID score is a useful metric but may be only somewhat relevant for this work given that the underlying network embeddings were trained on ImageNet natural images which are quite far from the structured graphic designs presented in this work. In particular, I would have liked to have seen more visual similarity metrics. As the authors themselves point out CLIP and DINO embeddings are both standard, but they are also standard for evaluation. It would have been nice to have had CLIP similarity score metrics and DINO embedding cosine similarity between the original and reconstructed image. Even better would have been a small scale human study in which the humans were asked to rate which reconstruction (baseline or DeaM) was more faithful to the original.
Regarding the method: it was unclear to me why it was necessary to caption every layer in the LLM training dataset. IIUC the captions are not necessary at inference time and are solely used to instruction tune the model. But it instruction tuning is the use case for the captions, then why are they required at the layer level? Presumably it would suffice to have design level instructions/captions as the authors themselves present in Figure 6.

理论论述

N/A

实验设计与分析

The experimental design seems fine to me, however as discussed above the breath of evaluation datasets and metrics is rather limited and as discussed below it may have been possible to include some other derendering baselines.

补充材料

IIUC there was no supplementary material submitted with this work. This was quite surprising and disappointing, as the number of example images in the main paper is quite limited. I would have liked to have seen many more Crello examples decomposed and recomposed, highlighting success cases and failure modes and also a random sample of e.g. 10 recompositions. In addition, while I understand that the training data may be proprietary, it seems reasonable that a small selection of training samples be provided for qualitative evaluation in the appendix.

与现有文献的关系

See below: Essential References Not Discussed.

遗漏的重要参考文献

The paper is fundamentally about derendering, there is quite a bit of existing work on derendering, both of graphic designs and more generally e.g. charts, tikz code. etc. that should have been discussed. In addition existing design derendering methods could have been compared to and discussed e.g. LIVE [https://arxiv.org/abs/2206.04655] and StarVector [https://arxiv.org/abs/2312.11556].

其他优缺点

I would like to use this section to state that despite the issues mentioned above, I really liked the paper. It is well written, on an interesting and underexplored topic. If the authors address several of my comments in the rebuttal phase, which seems very feasible I would be happy to raise my score and advocate more strongly for the paper.

其他意见或建议

N/A

作者回复

2025-04-01

Method Evaluation:

Considering the current scarcity of publicly available poster datasets and time constraints, we have first added a test set based on our own dataset split. We have added qualitative and quantitative experimental results.
Regarding evaluation metrics: Your suggestion is reasonable, we have added new similarity scoring metric in here. We recruited 10 volunteers for a human study to compare the evaluation effectiveness of CLIP and DINOv2. The choices between CLIP and DINOv2 were quite balanced.
Regarding the addition of hierarchical captions: We were inspired by COLE, which mentioned that layer-wise captions can enhance the model's perception of design elements. Hence, we intuitively added hierarchical captions.

Qualitative Results: We have added more qualitative experimental results, including both successful and unsuccessful cases. We also tested the single-layer images generated by the T2I model (here is the ideogram) and found that it is quite difficult to decompose such data, which significantly differs from the training data. This may require specialized collection of such T2I data for customized training to be successful. The decomposition results are here.

Discussion of Related Work: Thank you for your reminder. We will add these discussions. Existing design derendering methods such as LIVE and StarVector, although somewhat similar to our layer decomposition approach, still have some differences. We observed that these methods convert images into SVG format. However, SVG struggles to represent complex details in images, and these methods currently can only decompose simple graphics and cannot parse text. We have displayed some test results of LIVE in here. The tested images are from the test set of our dataset.

Others:

Equation 1 is correct. Minimizing the IoU loss is equivalent to maximizing the IoU, as IoU loss = 1 - IoU.
Thank you for pointing this out. We will revise it. Table 2 should be written as $Loc_\hat{\alpha}$ .
In theory, images can be represented using SVG or HTML. However, these representations require relatively long character strings (compared to our image encoding length), and they struggle to represent complex images and detailed information.
Thank you for your reminder. This part was not clearly stated. ImageNet and LAION were used as the training datasets for VQ-GAN. We used these datasets because posters contain many natural images, and we aimed to improve the encoding quality. The training for MLLM uses a dataset of 200k posters and the instruction data extended from it (see section 6.2).

审稿人评论

2025-04-08

Thank you for your detailed and constructive response. The authors have addressed many of my primary concerns/questions, therefore I will raise my score to accept.

One small comment: SVG and HTML can directly embed images (in pixels) by reference - it is not necessary to use the SVG primitives e.g. <path> elements to construct the image. Anyway, this is a small side-comment which has no effect on my score.

审稿意见

评分: 22025-03-13

This paper focuses on the graphic design layer decomposition task that converts graphic designs into ordered RGB-A layers and metadata. A large multimodal model, i.e., DeaM, is proposed with a two-stage process. The first stage produces layer-specific JSON metadata, and the second stage reconstructs pixel-perfect layers. The proposed model could support full decomposition and interactive decomposition.

update after rebuttal

I appreciate the authors' clarifications of some details. But my main concerns are not fully addressed by the rebuttal. Thus, I would like to keep my score as weak reject.

The proposed problem is not new, as there are many layer decomposition and generation papers published before, especially in graphics and vision conferences. The current version does not have a thorough discussion about the related works. Integrating LLM is a straightforward solution. In addition, compared with the commonly used metrics for graphic designs, the current evaluation metrics are not thorough enough.

The authors claim in the rebuttal that [2] does not have open-source code. But it is easy to find the corresponding GitHub link via Google.

给作者的问题

The statement that the training loss for CARD is the same as that for VQGAN(Esser et al., 2021) is unclear.

论据与证据

Layered design generation and decomposition have been studied before, and cannot be considered as a novel vision task.

方法与评估标准

The proposed method is a reasonable solution to graphic design decomposition and editing. The proposed dataset could be useful for future research in the community.

理论论述

The task formulation in Sec. 3 is a little bit unclear. The proposed method not only decomposes the image into an ordered series of RGB-A mode layers but also with metadata.

实验设计与分析

The experimental evaluation is not thorough.

The evaluation metrics only contain FID and Loc. FID cannot fully reflect the image quality. Additional metrics are necessary to evaluate the image quality, accuracy of the bounding box, and layer order.
For the ablation study, even Loc. is not used for evaluation.
The current experiment only compares the proposed method with one baseline. Additional layer decomposition methods should be compared and discussed in the experiments.
It would be better to add experiments to show the applications of the proposed layer decomposition method.

补充材料

No supplementary material was provided.

与现有文献的关系

The proposed method is a reasonable and useful solution for automatic graphic design understanding and manipulation.

遗漏的重要参考文献

Existing layered graphic design generation and decomposition should be cited and discussed in the paper [1-4].

[1] Sbai, Othman, Camille Couprie, and Mathieu Aubry. "Vector image generation by learning parametric layer decomposition." arXiv preprint arXiv:1812.05484 (2018).

[2] Du, Zheng-Jun, et al. "Image vectorization and editing via linear gradient layer decomposition." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-13.

[3] Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, et al. Cole: A hierarchical generation framework for graphic design. arXiv preprint arXiv:2311.16974, 2023.

[4] Naoto Inoue, Kento Masui, Wataru Shimoda, and Kota Yamaguchi. Opencole: Towards reproducible automatic graphic design generation. arXiv preprint arXiv:2406.08232, 2024

其他优缺点

The proposed method is a reasonable solution to graphic design decomposition and editing. The proposed dataset could be useful for future research in the community.

其他意见或建议

The abbreviation of the proposed method should be consistent. L024 is DeaM, while Fig. 6 uses DeAM.

作者回复

2025-04-01

Existing Layered Graphic Design Generation and Decomposition Work: Although layered design generation and decomposition have been studied previously, there are many differences compared to our proposed work. Our focus is on the task of layer decomposition in graphic design. We will include discussions of works [1-4] in our paper. [1] actually investigates image generation by decomposing objects into different region layers (e.g., a portrait being decomposed into hair, face, etc.), making the generation process more organized. [2] and [3] generate layout information and materials from the user's input text and then synthesize the posters. Our layer decomposition process is the opposite of the aforementioned works. Although [2] also implements layer decomposition, it is somewhat different from our approach, as it focuses mainly on parsing single objects and cannot parse text. It is better suited for scenarios where each layer has minimal color and texture variations. Since [2] does not have open-source code, we cannot compare with it. Therefore, we follow the reviewers' suggestions and compare our work with LIVE in the derendering field.

Experimental Evaluation:

We added an evaluation of layer quality. The Loc metric can reflect the accuracy of bounding boxes and layer order to some extent.
In the ablation experiments, Enhancing Prediction Regularity and Condition-Aware RGB-A Encoder had no impact on the accuracy of the output boxes from the MLLM. We tested the model without the Conjoined Visual Encoder, and the $Loc_\hat{\alpha}$ is 0.6826.
Since [2] does not have open-source code, we followed the reviewers' suggestions and compared our work with LIVE in the derendering field. The results are here. We observed that these methods convert images into SVG format. However, SVG struggles to represent complex details in images, and these methods currently can only decompose simple graphics and cannot parse text.

Regarding the writing: Thank you for your reminder. We will revise the task description and abbreviations.

CARD Training Loss: CARD stands for Condition-Aware RGB-A Decoder. It is an improvement of VQ-GAN by adding a conditional branch, and its training loss is the same as that of VQ-GAN.

[1] Sbai, Othman, Camille Couprie, and Mathieu Aubry. "Vector image generation by learning parametric layer decomposition." arXiv preprint arXiv:1812.05484 (2018).

[2] Du, Zheng-Jun, et al. "Image vectorization and editing via linear gradient layer decomposition." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-13.

[4] Naoto Inoue, Kento Masui, Wataru Shimoda, and Kota Yamaguchi. Opencole: Towards reproducible automatic graphic design generation. arXiv preprint arXiv:2406.08232, 2024

审稿意见

评分: 32025-03-22

This paper proposes a novel layer decomposition model (DeaM) to transform a given graphic design into a set of ordered transparent layers. The key challenges include predicting the correct layer ordering and resovling the mutual occlusion between overlapping layers. To this end, DeaM first predicts a layer-specific JSON metadata followed by a condition-aware RGB-A decoder that can recontruct high-quality transparent layers.

给作者的问题

No further questions.

论据与证据

The authors propose using an LLM to transform a given set of discrete image tokens (prefix) that represent an entire image into a set of tokens consisting of the position of each layer, the attributes of each layer, and the exact visual tokens of each layer, with two different lengths: either 144 tokens or 64 tokens. A non-trivial design aspect is that the authors only encode the entire image with a combination of CLIP and DINOV2 visual encoders, rather than also considering the fine-tuned RGBA autoencoder. According to my understanding, the potential benefits of using the RGBA autoencoder are twofold: first, most of the predicted discrete visual tokens are essentially copied from the conditional global visual tokens based on the RGBA autoencoder; second, it is confusing for me to understand how the LLM can excel in learning to predict the discrete tokens in the RGBA autoencoder space from the space based on CLIP and DINOV2, which are non-trivial and challenging for the model to learn.

Another major concern is that the authors should support variable lengths for each transparent layer, rather than simply choosing two fixed lengths, considering that different transparent layers have totally different resolutions.

According to Figure 4, the layer decomposition effect is weak, as all of the decoration layers are arranged within a single layer in the first example. Thus, it may raise concerns that the proposed layer decomposition scheme is weak and cannot be extended to support any number of layer decompositions.

方法与评估标准

The proposed approach is concise but non-trivial, considering that the target visual token space in the image layer is extracted using a completely different RGB-A autoencoder, while the input visual tokens are extracted based on CLIP and DINOV2.

Another concern is that the authors should demonstrate the following key aspects:

Whether the proposed approach performs well when handling an increasing number of layers, such as more than 5 layers.
Whether the proposed approach can generalize to single-layer images generated with T2I models like FLUX, rather than to in-domain ones like the Crello dataset.

理论论述

There are no theoretical claims in this paper.

实验设计与分析

The major concerns about the experimental results are that the FID score is still very high, even with the proposed three techniques. In addition, the visual results in Figure 4 are too naive and fail to demonstrate the potential value of such weak layer decomposition performance.

补充材料

The authors do not provide any supplementary material.

与现有文献的关系

I do not see the potential of this work for the broader scientific literature.

遗漏的重要参考文献

No.

其他优缺点

As mentioned earlier, the major concerns include:

The authors propose using an LLM to transform a given set of discrete image tokens (prefix) that represent an entire image into a set of tokens consisting of the position of each layer, the attributes of each layer, and the exact visual tokens of each layer, with two different lengths: either 144 tokens or 64 tokens. A non-trivial design aspect is that the authors only encode the entire image with a combination of CLIP and DINOV2 visual encoders, rather than also considering the fine-tuned RGBA autoencoder. According to my understanding, the potential benefits of using the RGBA autoencoder are twofold: first, most of the predicted discrete visual tokens are essentially copied from the conditional global visual tokens based on the RGBA autoencoder; second, it is confusing for me to understand how the LLM can excel in learning to predict the discrete tokens in the RGBA autoencoder space from the space based on CLIP and DINOV2, which are non-trivial and challenging for the model to learn.
Another major concern is that the authors should support variable lengths for each transparent layer, rather than simply choosing two fixed lengths, considering that different transparent layers have totally different resolutions.
Whether the proposed approach performs well when handling an increasing number of layers, such as more than 5 layers.
Whether the proposed approach can generalize to single-layer images generated with T2I models like FLUX, rather than to in-domain ones like the Crello dataset.

其他意见或建议

No other comments.

作者回复

2025-04-01

Thank you for your valuable feedback and insightful comments.

About RGBA autoencoder:

This idea is very straightforward and interesting. Initially, we also tried using RGBA autoencoder as the visual encoder but found that the model's output generated severe hallucination information. We speculate that the RGBA autoencoder, using only image reconstruction as the training objective, leads to its inability to understand some semantic information. Therefore, using RGBA autoencoder as the visual encoder for MLLM would make it challenging to perform visual understanding tasks (layer decomposition tasks require visual understanding because they need to output not only image encoding but also metadata about the layers). Additionally, we observed that some recent works like UniTok simultaneously train the visual encoder using both contrastive and reconstruction losses. Using such an encoder might aid in layer decomposition tasks, which is also a future research direction for us.
Our understanding is that the discrete tokens in the RGBA autoencoder space actually correspond to certain visual concepts (colors, textures, etc.). Considering the model capacity of LLM, learning this mapping relationship should still be feasible.

Variable-Length Layer Encoding: Your suggestion is very reasonable. We have considered this approach, but we believe there are some issues:

Resolution and encoding length might not be directly correlated. Some layers, such as pure color backgrounds, may have very high resolution but convey very little information, and thus can be represented with fewer tokens. Therefore, determining the appropriate encoding length for an image is also a challenge.
If variable lengths are introduced, the model often makes errors in predicting token lengths, which can lead to failures in decoding the image.

Regarding Figure 4: Thank you for your reminder, it helps us further clarify the details. This example is intriguing. We believe that if there are many decorative elements and their spatial relationships do not have an apparent hierarchical order, the model tends to predict them as being on the same layer. This situation is relatively uncommon. Our model does have some capability in decomposing multi-layer graphic designs, and we will display more results of layer decomposition below.

Qualitative Results Display:

The decomposition results for some designs with more than 5 layers are here.
We tested the single-layer images generated by the T2I model (here is the ideogram) and found that it is quite difficult to decompose such data, which significantly differs from the training data. This may require specialized collection of such T2I data for customized training to be successful. The decomposition results are as here.

审稿意见

评分: 32025-03-24

The paper proposes the problem setup Layer Decomposition (LD), and an approach Decompose Layer Model (DeaM), that can take the rendition (image) of a single page graphic design, and "decompose" it to its constituent components. The problem setup that the paper introduces has immense practical value, for instance, a user can scan a graphic design, and make changes to them, after it is broken down into its constituent components by DeaM.

Update after Rebuttal

I thank the authors for their efforts in clarifying my queries and concerns. I am raising my score from weak reject to weak accept. The reason why I am not increasing the scores even further are:

As pointed out earlier in my review, I genuinely feel that the way to evaluate the quality of each layer of the decomposed design should improve. Measuring FID of each component layer cannot serve a proxy to ensure whether each component that should have been decomposed into each layer has been done so.
I share concerns with Reviewer NQZP on the quality of the output and with Reviewer ZHaU on not having much qualitative results. I appreciate that the authors shared 6 results during rebuttal phase, but that is too less for a computer-vision oriented paper. Combined with the fact that FID is the main quantitative metric (which my fellow reviewers has noted that it is not ideal to be used), makes it hard to form an informed decision.

As the paper introduces a new problem setting, it would greatly benefit the community if they release the model and the dataset, so that fellow researchers can build upon them. Thank you!

给作者的问题

Please see the other sections.

论据与证据

The paper claims to introduce a new problem setup and the first approach towards the same, which is true.
The paper claims to have collected a new dataset, but is largely silent about its key attributes:

How was it collected?
Are they all single page graphic designs?
How many datapoints are there (the paper vaguely says over 200,000 designs, but what is the exact number)?
What all domains of graphic designs does it cover: are they fashion, retail, corporate, and what else?
What is the average number of layer in the dataset?
What is the proportion of text and image layers in the dataset and on on...

方法与评估标准

The proposed approach is logical, but lot of details are missing:

Sec 5.1 abruptly starts with the discussion on VQ-GAN. From the context, it seems that VQ-GAN has to be adapted to take in the alpha channel also. How was this adaption done? Was the Encoder and Decoder of VQ-GAN modified to include additional layers to consume the additional channel?
Line 203 says that the VQ-GAN is trained on "poster images". Which dataset are these coming from?
In sec 5.2, DINO v2 features are used to focus on "lower-level visual elements" (Line 217). Its very unclear how adding DINO v2 features helps in getting more attention to the elements like graphic lines and shape.
In line 234, 235, the input resolution of natural images and decorative elements are set to 192 × 192 and 128 × 128. This choice should be validated by an ablation experiment. Its unclear why this choice improves performance.

On the evaluation:

The model is trained on their new dataset, while evaluated on public dataset (Crello). This doesnt seem like a good evaluation protocol. It would have been ideal to train-and-test on the proposed dataset and Crello separately, this would showcase the mettle of the proposed approach on two datasets.
The key characteristic of DeaM is it ability to create layers from the input image. Hence, during evaluation, the quality of each layer should be explicitly checked for. Currently the image reconstruction quality is evaluated. Even if each layer is not perfect (say, two components that should have been separated into two layers, got into same layer), the reconstruction might be good, and hence is not a proxy for the quality of each layer.
There are only two qualitative results in the paper (figure 4), which is too less to make an informed judgement.
Line 415 reads: "DeaM excels in text reconstruction due to its accurate prediction of text details such as content, font, size, and color.", which is an absolutely flawed assertion as the figure 5 referred to in this section has gibberish text (see how breakfast is transformed), and has spelling mistakes "#ThouchFreeDelivery", "count -> cocount" and so on...
Need to showcase failure cases too.

理论论述

None

实验设计与分析

See above

补充材料

Supplementary material not provided.

与现有文献的关系

Properly placed.

遗漏的重要参考文献

None

其他优缺点

The writing in the paper should improve a lot, like the term CARD is not introduced in the writing. The intro reads like it is preempted abruptly. The paper will benefit from a strong proof-reading.

Minor comment:

The new problem setting that the paper proposes is named as "Layer Decomposition (LD)", which is too broad. Ideally, any image, 3d-scene or any such data can be decomposed into layers, and the paper is not proposing a generic method to decompose all of those. The paper is specific to single-page graphic design, and hence the problem setup can be better termed as "Layer Decomposition of Graphic Designs (LDGD)"

其他意见或建议

I genuinely feel that there is merit to the problem setup, but the loopholes in the methodology, and insufficient evaluation makes me gravitate towards rejecting the paper in its current state. Happy to be convinced otherwise.

作者回复

2025-04-01

Thank you for your constructive feedback, which gives me the opportunity to clarify the ambiguities in the content of this paper.

Dataset Information: This dataset is collected from the internet and consists of single-page graphic designs (including all layers of the materials), comprising 224,054 samples. The samples are primarily posters and include areas such as holiday events, retail, dining, and corporate domains. The average number of layers in the dataset is 10.30, with an approximate image-text ratio of 6.3:3.7.

VQ-GAN Details: The modification is very simple by adjusting the number of channels in the convolutional kernels of the first and last layers from 3 to 4 to accommodate the alpha channel. The training data for the poster images consists of over 200,000 samples collected as previously mentioned. However, we use the materials from all image layers for training, whereas MLLM uses the final poster images for training.

DINOv2 Details: We were inspired by the COMM work. The visual encoder of CLIP is evidently well-aligned with the word embedding space, but due to the global supervision from image captions, it fails to learn more detailed pixel-level information. This might hinder the fine-grained perception capabilities in MLLM. Therefore, we added a visual encoder based on self-supervised training, DINOv2, whose self-supervised training approach enables it to focus more on pixel-level details (e.g., simple geometric element ).

Resolution Settings: Higher image resolutions lead to clearer natural reconstructions, but they also make the training sequences of the model much longer and significantly increase computational cost. Based on empirical observations from VQ-GAN's experimental results on natural images, we chose a resolution of 192. Decorative elements are generally simpler than natural images, so a smaller resolution of 128 is sufficient to ensure adequate quality.

Regarding the evaluation:

The Crello training data is quite small (approximately 20k), making it challenging to obtain a reasonably good decomposition model. Due to time constraints, we first re-divided our dataset, randomly selecting 1000 images as a test set for evaluation. The results are here.
We added evaluation results for single layers to this link.
We presented qualitative results on the test set of our dataset in here.
We will revise the description here. Currently, the model tends to generate more hallucinations in the predictions for small text and artistic text, whereas the attribute predictions for large printed text are relatively feasible.
We supplemented the presentation with many failure cases, primarily highlighting issues where the reconstructed layers differ significantly from the original image, the cases are here.

Regarding the writing:

CARD stands for Condition-Aware RGB-A Decoder. We will further refine the content of the introduction and methodology sections.
Thank you for your suggestion. "Layer Decomposition of Graphic Designs (LDGD)" indeed better fits the task of this paper, and we will make the necessary modifications.

最终决定Accept (poster)

2025-05-01

This paper introduces Layer Decomposition (LD), a new vision task that converts graphic designs into structured, ordered RGB-A layers with metadata for easier editing and reuse. The paper proposes DeaM, a multimodal model that tackles challenges like layer ordering and occlusion through a two-stage process combining visual encoding, metadata generation, and precise layer reconstruction.

The paper was reviewed by four experts with final ratings of one Accept, two Weak Accepts and one Weak Reject. All reviewers appreciated the setting but also had concerns around the evaluations (limited results, need for more ablation studies). The authors' response largely addressed the concerns of the reviewers. After the rebuttal phase, Rev#6rxP upgraded their score to Weak Accept, explicitly stating that the problem setup is novel, and the approach is logical although the evaluation could have been stronger. While some concerns remained around the need for more qualitative results, the additional results in the rebuttal and the interestingness of the setting itself tilted the reviewers' ratings towards acceptance.

Given the mixed reviews, the AC went through the paper before making a decision. Considering that the positives outweigh the negatives, the paper is recommended for acceptance. The authors are recommended to integrate the rebuttal responses in the final version.