PaperHub
5.3
/10
Poster3 位审稿人
最低3最高7标准差1.7
3
7
6
3.7
置信度
正确性2.3
贡献度2.7
表达2.7
NeurIPS 2024

DeBaRA: Denoising-Based 3D Room Arrangement Generation

OpenReviewPDF
提交: 2024-05-16更新: 2025-01-07
TL;DR

We propose DeBaRA, a conditional score-based generative model that performs state-of-the art arrangement generation in bounded indoor scenes and several downstream applications by solely learning 3D object spatial features.

摘要

关键词
Indoor 3D Scene SynthesisLayout GenerationScore-based Generative ModelsDiffusion ModelsConditional Generation

评审与讨论

审稿意见
3

The authors present DeBaRa, a diffusion-based framework for indoor scene generation, and a self-score evaluation strategy to select conditioning input. They demonstrate the effectiveness of their approach on scene synthesis and several downstream tasks.

优点

  • The empirical results are promising.

缺点

  • Insignificant Architectural Difference: The architectural differences between DeBaRa and other diffusion-based baselines (e.g., DiffuScene) are not substantial. The results benefited architecture difference is not sufficiently supported by experimental evidence. To be honest, we do not care about final performance, but the impact of the differences.
  • Lack of Implementation Details and Ablation Studies: The paper does not provide sufficient implementation details, making it difficult to reproduce the results. Additionally, ablation studies are missing, which are crucial to understanding the contributions of individual components of the model.
  • Missing Experiments: The paper claims advantages in downstream tasks like rearrangement and completion but lacks comparative experiments with baseline methods (ATISS, LEGO-Net, and DiffuScene) both qualitatively and quantitatively.
  • Unclear Writing and Missing Sections: The writing is unclear in several parts of the paper. Additionally, Section 4.4 ("Additional Results") is missing, which further reduces the clarity and completeness of the work.
  • Typing Errors: There are numerous typing errors throughout the paper, such as "biaises" instead of "biases" in Line 39. These errors affect the readability and professionalism of the manuscript.

问题

  • The inference time is reported to be significantly faster than DiffuScene (about 100 times faster). Are these advantages due to the use of EDM over DDPM? Please elaborate on how the choice of EDM contributes to the improved inference time.

局限性

N/A

作者回复

We thank reviewer sfxx for their time and feedback. We address the reported weaknesses and questions in the following response:

Insignificant Architectural Difference: The architectural differences between DeBaRa and other diffusion-based baselines (e.g., DiffuScene) are not substantial.

First, we would like to emphasize that although we do not consider our neural network architecture to be a key contribution, there are some significant differences between ours and that of DiffuScene [1]. Specifically, unlike DeBaRa, [1] adopts a U-Net backbone with 1D convolutions. It is not conditioned on the floor plan of the room and therefore performs unbounded scene synthesis. Our architecture features fixed positional encoding modules, linear layers, a Transformer encoder as well as a PointNet feature extractor and is therefore close the one of LEGO-Net [2] (which is not a Diffusion model), as deliberately stated in our main submission (L146).

More fundamentally, however, and as mentioned in the summary above as well as in the paper, we consider our key contributions to lie in:

  1. a continuous-time score-based model for indoor layout generation, which output domain has been simplified to learn unconditional and class-conditional densities of object bounding boxes, expressed in a common 3D coordinate space (Section 3.2). It is trained following
  2. a novel Chamfer objective that is permutation-invariant by design (Section 3.3).
  3. an original Self Score Evaluation (SSE) procedure to optimally select conditioning inputs from external sources, leveraging density estimates provided by the pretrained model, allowing our method to be the first to unify the use of a LLM and of a specialized diffusion model in the context of 3D scene synthesis (Section 3.4).

These are the key novelties of our work, which do not exist in DiffuScene or any other prior approach, and which allow us to achieve state-of-the-art capabilities in 3D layout generation (Table 1), 3D scene synthesis (Table 2) and scene re-arrangement (PDF Table 3), while adopting a backbone that is significantly more lightweight than previous methods, which further enables real-time (< 1s) efficient sampling (Table 3).

The paper does not provide sufficient implementation details, making it difficult to reproduce the results.

Implementation details are extensively described throughout the paper and submitted appendix, notably:

  • Section A.1. Denoiser training parameterization.
  • Section A.2. EDM sampling procedure (Algorithm 2) and hyper parameters (L487-488).
  • Section B.1. Network architecture with positional encoding formula and number of frequencies, number of layers of linear modules, their activations, dropout rate, output dimensions as well as details on the Transformer implementation with number of encoder layers, their hidden dimensions, number of heads and token masking strategy.
  • Section B.2. Details on the training protocol, including the number of epochs, batch size, optimizer, learning rate, learning rate schedulers, and data augmentation routine.
  • Section B.3. Details on the (re)implementation of baseline methods.
  • Section B.4. Link to the model and implemented prompting strategy in our LLM-guided scene synthesis pipeline.
  • Section 3.4. Pseudo-code for SSE (Algorithm 1).
  • We follow the data preprocessing from ATISS [3], as stated L224.

We also include additional implementation details in our general response, but would be happy to answer implementation-related questions if remains any.

The paper claims advantages in downstream tasks like rearrangement and completion but lacks comparative experiments with baseline methods

Including additional experimental results against established baselines on downstream task is a valuable suggestion. We include quantitative and qualitative experiment results against LEGO-Net, which itself outperforms ATISS [3] on the scene re-arrangement task in Table 3 and Figure 1 of the rebuttal PDF, as stated in our general response.

Unclear writing and typing errors

We thank reviewer sfxx for reporting a typographical error and missing reference to the table and figure of section 4.4 (making the dedicated section appear empty in the paper). This issue and other minor typos, have been corrected in the current version of the manuscript. We will also make sure to incorporate every additional comments in the final version.

Please elaborate on how the choice of EDM contributes to the improved inference time.

As stated in our manuscript (L474-476), EDM proposes a 2nd order Runge-Kutta stochastic sampler (Algorithm 2) that provides a favorable trade-off between generation quality and number of function evaluations (NFE), as verified throughout the quantitative evaluations of the seminal paper [4]. Our implementation uses 50 sampling steps, which results in a NFE of 101. On the other hand, DiffuScene uses ancestral DDPM sampling with 1000 steps / function evaluations. Additionally, our original design choices enable our transformer-based backbone to feature around 7 times fewer parameters (Table 3), which has a direct impact on inference time.

[1] DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis, Tang et al., 2024.

[2] Lego-net: Learning regular rearrangements of objects in rooms, Wei et al., 2023

[3] ATISS: Autoregressive Transformers for Indoor Scene Synthesis, Paschalidou et al., 2021

[4] Elucidating the Design Space of Diffusion-Based Generative Models, Karras et al., 2022

评论

Thank you for your response. While some of your points partially address my concerns, I am still inclined to recommend rejecting this paper.

  • Novelty: Using EDM to replace DDPM for indoor scene modeling does not represent a significant contribution, which does not offer much new insight into the field of indoor scene generation.

  • Experiments: Since the focus is on the training framework, its effectiveness is not validated on a sufficiently scalable dataset. To my knowledge, the synthetic data (3D-FRONT) used contains only around 5k scenes for training (with subsets like library being even smaller), which is relatively small compared to current image datasets.

  • Comparisons: The lack of comparison with many autoregressive model works, such as COFS [1], and scene-graph-based methods like SceneHGN [2] and GRAINS [3], is concerning.

  • Evaluation Metrics: The metrics used in this study, such as FID, KL, and CAS, seem rather generic. They might not fully capture the essential aspects of scene generation, including diversity, complexity, symmetry pattern discovery, object interaction, and object concurrences.

  • Rendering Quality: The quality of the rendered indoor scenes does not appear to meet the standards expected by artists. For better examples, the authors may refer to [2, 3].

Questions:

  • It would be insightful to investigate whether this diffusion-based method can produce more diverse scenes, compared to autoregressive-based methods.
  • The evaluation protocol is not clear. How many generated/GT scenes do you use for evaluation?

References:

[1] Para et al., "COFS: Controllable Furniture Layout Synthesis," SIGGRAPH 2023.

[2] Gao et al., "SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation with Fine-Grained Geometry," TPAMI 2023.

[3] Li et al., GRAINS: Generative Recursive Autoencoders for INdoor Scenes, TOG 2019.

评论

To summarize the key points:

  1. We evaluate our approach on the same standard dataset as done in recent state-of-the-art methods [1,2,3].
  2. The baselines that we compare against supersede the methods suggested by the reviewer. Nevertheless, for completeness, in the final version we will be happy to add comparisons against methods, which can be evaluated on our test set.
  3. We use the same presentation pipeline as in recent published works, ensuring consistent comparisons.

We thank reviewer sfxx for their constructive feedback. We are confident that all of the requested changes can be addressed in a minor revision. Furthermore, we remain confident of the contribution of our work, as also confirmed by, e.g., reviewer Fsdh.

[1] DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis, Tang et al., in CVPR 2024.

[2] Lego-net: Learning regular rearrangements of objects in rooms, Wei et al., in CVPR 2023.

[3] COFS: Controllable Furniture Layout Synthesis, Para et al., in SIGGRAPH 2023.

[4] Generalization in diffusion models arises from geometry-adaptive harmonic representations, Kadkhodaie et al., in ICLR 2024.

评论

We thank reviewer sfxx for increasing their initial scores and for taking the time to engage in the discussion period.

We want to highlight that most of reviewer's additional concerns and questions were not raised in their initial review. While we would have been happy to address them in our rebuttal, we do so in the following response:

Novelty

We refer the reviewer to the list of contributions mentioned in our rebuttal. Although we do not consider the use of EDM alone to be one of our major contributions, we don't employ it as a simple drop-in replacement for DDPM and discuss the relevance of this design choice in the context of indoor scene synthesis in our main submission, L132-139.

Experiments

We evaluate our method on the standard dataset for 3D indoor scene synthesis, which is the one used in the most recent baselines [1, 2], including the COFS [3] paper mentioned in the reviewer's comment.

We also want to emphasize that we actually view the performance of our method (demonstrated by our qualitative and quantitative evaluations), and its ability to generalize to complex / unseen floor plans (as shown in Figure 2 of our rebuttal PDF) as additional strengths in the light of the limited number of training samples.

Finally, generative diffusion models are known to scale favorably well to larger training datasets [4].

Comparisons

As stated in our main submission L219, we chose to evaluate DeBaRA against established baselines from different model families (ATISS, i.e., autoregressive ; DiffuScene / LEGO-Net, i.e., denoising-based ; LayoutGPT, i.e., LLM-based).

We can read in the COFS paper: "Our model thereby extends the baseline ATISS with new functionality while retaining all its existing properties and performance", which can be verified in their experimental evaluations. On the other hand, our method largely outperforms ATISS, as quantitatively verified in Table 1 and Table 2 of our submission, and observed in Figure 3.

The GRAINS method only supports rooms having four walls in its predicted layouts, while an important feature of our approach is that it takes into consideration complex (i.e., non square) input floor plans. As a result, it is not applicable to a significant number of scenes from our test set. In contrast to DeBaRA, SceneHGN proposes to generate scenes at the object part-level, which requires a custom dataset. However, we are willing to evaluate the 3D layout generation capabilities of the method in the revision of our manuscript.

Evaluation Metrics

Unlike what is stated in the reviewer's comment, KL Divergence is not measured in our study as object's semantic categories are not part of DeBaRA's prediction space. For the same reason, measuring object concurrences may not be a relevant addition to our paper.

Also note that FID and KID are known to evaluate both the diversity and the fidelity of the generated content. Non-diverse generation results wouldn't comprehensively capture the distribution of real scenes, which would directly penalize the FID and KID scores. Additionally, a better SCA score reflects more plausible layouts, as they are harder to distinguish from real ones.

The superiority of denoising-based approaches in capturing symmetry / alignment patterns compared to e.g., autoregressive methods in the context of scene synthesis has been extensively studied by previous work [1, 2].

We also include in our submission's appendix additional indicators measuring the validity of generated layouts w.r.t. the provided floor plan (Table 5).

Rendering Quality

Our scene renderings, with objects colored according to their semantic categories, have been included to help readers appreciate the quality, diversity and validity of generated layouts, while easily distinguishing different objects. We do not claim our renderings to be at the level of those produced by an artist. We used the rendering pipeline provided by the official implementation of DiffuScene (which builds upon the one of ATISS). Our rendering quality is therefore on par with the one of this recently published work. However, also note that our method could be employed along more advanced rendering engines.

We will follow your suggestion and include additional qualitative results featuring textured objects and floors, which can be obtained using our current rendering pipeline.

It would be insightful to investigate whether this diffusion-based method can produce more diverse scenes

Please see our previous answer regarding FID / KID.

How many generated/GT scenes do you use for evaluation?

As stated in our main submission, evaluation metrics are computed across each test subset (L244), following the splits described L225. For each test subset, we generate the same number of scenes as the number of real ones.

审稿意见
7

This paper studies 3D room arrangement/layout generation. It proposes DeBaRA, a diffusion-based generative model, which can generate layouts given the list of furniture and the floor map of the room. The proposed method provides good results and is able to work in various scenarios: layout generation, LLM-guided text-to-scene generation, scene completion, etc.

优点

  • The proposed method is a novel diffusion-based generative model, which is effective in the aimed task.
  • The results generated are high-quality both qualitatively and quantitatively.
  • An LLM-guided alternative pipeline is provided to simplify the input format, which adds more functions to the model.

缺点

  • There is no ablation study of the design choices. It is unclear whether and how each design choice and component helps the results.
    • It would be better if the ablation results could be provided for the following designs: EDM v.s. DDPM/DDIM, designed 3D spatial objective v.s. simple MSE, different classifier-free guidance scales, with v.s. without SSE.
  • Some of the designs are not clearly described in the paper. Besides, some of the assumptions are not very clear.
    • I wonder whether adding noise to the scene will cause the noisy layout to contain overlapped objects, objects of unreasonable sizes/rotations, etc. I did not see any of these in Fig.1. It might be better if a visualization of the process of scene being denoised could be provided.
    • I wonder how the size is determined by the generative model. For example, a sofa can be long or short, a table can be large or small, and a cabinet can be as tall as the wall or as small as a bedside table. How will the model decide the sizes of these objects? Can the user indicate the rough size of the object?
    • I wonder if it is assumed that most objects should be located in a rigid way that each edge is parallel to one wall? These are observed in most of the results of the proposed model in Figs. 3~6. If so, will these limit the diversity of generation results?
  • It seems that all the floor maps are relatively simple. I wonder how the model can work in more complicated floor maps, e.g., a floor with many rooms, a round room, or a triangle room.
  • In contrast to the claim in Q4 of the Checklist, many implementation details are not revealed, e.g., training settings (e.g., lr and iterations) and classifier-free guidance scales.
  • (Minor) The reason why this diffusion model can work might be that it can be regarded as an extension from some point cloud diffusion models, which also directly apply denoising on coordinates or some explicit presentations. However, this direction was not discussed in the related work.
  • (Minor) There are some minor math presentation issues. For example, multiple minmins should be corrected to min\mins (e.g., L166, L178).

问题

Please see "Weaknesses".

局限性

The authors mentioned some limitations in the paper and did not indicate the societal impacts.

One possible societal impact might be that "the generated layouts may lead to unsafe constructions, and therefore the model should given warnings about actually using it for constructing real-word rooms".

作者回复

We thank reviewer Fsdh for their time and positive feedback. We address the reported concerns in the following response:

It would be better if the ablation results could be provided for the following designs: EDM v.s. DDPM/DDIM, designed 3D spatial objective v.s. simple MSE, different CFG scales

We provide an ablation study of these design choices in Table 1 and Table 2 of the rebuttal PDF. Note that as we don't apply CFG at sampling time, we chose to ablate the use of conditioning dropout on the input object categories during training.

Some of the designs are not clearly described in the paper

One general note to clarify some of the following concerns is that unlike previous work, we don't prevent unwanted behaviors such as object collisions, out-of-bounds or misaligned elements using additional loss terms [1] or rigid / hard-coded rules. Instead, we adopt a purely data-driven approach to capture complex patterns solely from the training layouts. Remarkably, results reported in Table 5 of our submission indicate that DeBaRA largely outperforms previous methods at respecting the indoor floor plan (i.e., keeping objects within its bounds).

I wonder if it is assumed that most objects should be located in a rigid way that each edge is parallel to one wall?

It is not. The 3D-FRONT dataset that we use for training predominantly contains rooms in which objects are aligned with walls and don't have much exotic angles (i.e., different from 0° or 90°). We will be happy to include a statistical analysis of the training and generated data in the revision of our manuscript to quantitatively support our observations. Additionally, on Figure 6 (right), we perform scene completion by adding a bookshelf (pink) and a coffee table (blue). We repeat the experiment ten times and report the denoising object trajectories, intermediate and final positions (colored and black dots respectively). This allows to observe the variety of predicted layouts. Notably, we can see that the bookshelf ends up in various different positions, always next to a wall.

I wonder how the size is determined by the generative model. [...] How will the model decide the sizes of these objects?

During the denoising process, coarse spatial attributes are typically determined during the early iterations (i.e., high noise levels, injection of fresh noise), while precise / fine-grained features are set in the late time steps (low noise levels, no injection of fresh noise). Notably, the added stochasticity in the early denoising steps helps better explore the space of possible features. These general assessments on diffusion sampling have been explored by other work in the context of image generation [2]. They can be qualitatively observed in Figure 3 of the rebuttal PDF.

Can the user indicate the rough size of the object?

Yes, DeBaRA can be used to sample layouts from specified (i.e., fixed) spatial features such as object dimensions, as described in Section 3.5, using the binary mask m\mathbf{m} of L209. If users want to input the rough size of objects (i.e., instead of exact one), m\mathbf{m} can be relaxed (i.e., set to 0\mathbf{0}) in the late iterations to let the model adjust fine grained dimensions. Denoising time step from which m\mathbf{m} is relaxed can be set depending on the precision of user-defined sizes. We find this reviewer suggestion to be a very practical and intuitive use of our method that further highlights its versatility and we will be happy to include this in the final version.

It might be better if a visualization of the process of scene being denoised could be provided.

Injecting noise may produce such invalid intermediate 3D configurations in early time steps (i.e., when spatial features are far from their final values). However, these phenomena will tend to diminish during the denoising process, thanks to the decreasing noise schedule. These can be better observed in Figure 3 of the rebuttal PDF.

I wonder how the model can work in more complicated floor maps

The 3D-FRONT dataset mostly contains relatively simple (i.e., single room, square, rectangular) floor maps both for training and evaluation. As a result, we haven't encountered any round floor map in our exploration of the test set. Consequently, we manually designed such out-of-distribution floor shapes (round, triangular and report DeBaRA's generation in Figure 2 of the rebuttal PDF. This demonstrates the robustness of our method to unseen rooms.

Many implementation details are not revealed, e.g., training settings (e.g., lr and iterations) and CFG scales.

Please note that these details are extensively described throughout the submitted appendix. For instance, we can read in Section B.2, L520-522 that we trained our models for 3000 epochs, with a batch size of 128 using the AdamW optimizer and learning rate 1e41e^{-4}. During training, we use a conditioning dropout rate of 0.2 but didn't find the need to amplify the strength of the conditioning using any classifier-free guidance scale during sampling.

it can be regarded as an extension from some point cloud diffusion models

Thank you, methods leveraging diffusion models to generate point clouds [3] or other geometric representation involving 3D coordinates [4] are a relevant addition to our manuscript's references.

We thank again reviewer Fsdh for their detailed and constructive feedback as well as for their insightful suggestions that will improve the quality of our paper.

[1] DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis, Tang et al., 2024.

[1] Exploring Diffusion Time-steps for Unsupervised Representation Learning, Yue et al., 2024.

[2] Diffusion Probabilistic Models for 3D Point Cloud Generation, Luo et al., 2021.

[3] BrepGen: A B-rep Generative Diffusion Model with Structured Latent Geometry, Xu et al., 2024.

评论

I sincerely thank the author for their detailed and well-organized rebuttal. All my concerns are addressed. I would like to increase my rating from 6 to 7. I would suggest the authors revise their paper according to the rebuttal.

审稿意见
6

This paper proposes a method for room layout generation given objects and a floor plan using score-based EDM. First, three encoders are used to encode objects, the floor plan and the noise respectively. These encoded latents are then given as input to a noise based scene encoder which decodes into the output latents. The output latents are then finally decoded to their respective categories and the decoded output is pushed to be closer to the input as measures by a proposed semantic aware Chamfer distance loss. Once trained, the model is able to generate novel layouts and scene in addition to being able to perform completion, rearrangement and retrieval

优点

  1. Qualitatively, the layouts generated seem far more coherent and spatially aligned than prior work.

  2. Quantitative results provide further evidence that this model performs well.

  3. The paper is well written and easy to follow.

缺点

  1. While using a permutation invariant Chamfer loss is intuitive, I believe it should still be ablated w.r.t just the standard chamfer loss

问题

N/A

局限性

Yes

作者回复

We thank reviewer xNz4 for their time and feedback. We address the reported weakness in the following response:

While using a permutation invariant Chamfer loss is intuitive, I believe it should still be ablated w.r.t just the standard chamfer loss

Evaluating the impact of our semantic-aware objective against a standard Chamfer loss is a relevant suggestion. Advantage of our novel formulation is quantitatively verified in Table 1 of the rebuttal PDF. It is also evaluated against a simple MSE objective.

作者回复

Response to all reviewers

We would like to thank reviewers for their time and insightful feedbacks and are pleased that they recognized our submission to propose a "novel diffusion-based generative model, which is effective in the aimed task" with an "alternative pipeline which adds more functions to the model" (Fsdh), in a way that is "well written and easy to follow" (xNz4). Our experimental evaluations have also been appreciated, as we read that "results generated are high-quality both qualitatively and quantitatively / seem far more coherent and spatially aligned than prior work" (Fsdh, xNz4).

Before addressing reviewers' common concerns and questions, we wish to briefly highlight our main contributions:

  1. a lightweight score-based model trained to learn the class-conditional and unconditional densities of 3D layouts in bounded indoor scenes using a novel 3D spatial objective.
  2. a novel Self Score Evaluation (SSE) procedure to optimally select conditioning inputs from external sources using density estimates provided by the pretrained model.
  3. a flexible sampling method to perform multiple downstream tasks from partial features (e.g., scene completion) and/or intermediate noise levels (e.g., scene re-arrangement, object retrieval).

All these contributions are key to achieving the state-of-the-art performance exhibited by our framework. Our method is also the first to unify the use of a specialized diffusion model and a separately trained LLM in the context of 3D scene synthesis.

1. Ablation study

sfxx, Fsdh, xNz4: Reviewers unanimously suggested that additional ablations of our design choices would further clarify our contributions.

Please note that in Table 2 of our main submission and as described L259, we compare a scene synthesis set-up in which the input object semantics are selected from a set of LLM-generated ones, either randomly (LLM) or by applying SSE (LLM + SSE). Additionally, Table 3 reports the impact of applying SSE on the generation time. These results quantitatively measure the individual impact of SSE in our scene synthesis pipeline.

We provide a study to evaluate the role of other individual components in the attached PDF:

  • Fsdh: Our 3D spatial objective v.s. simple MSE (PDF Tab 1.)
  • xNz4: Our 3D spatial objective v.s. standard Chamfer loss (PDF Tab 1.)
  • Fsdh: Use of conditioning dropout during training (PDF Tab 1.)
  • Fsdh, sfxx: Different sampling strategies (DDPM, EDM) (PDF Tab 2.)

2. Additional experimental results

We follow the reviewers' suggestions and provide new results to support the performance and versatility of our method.

  • sfxx: We include quantitative (PDF Tab 3) and qualitative (PDF Fig 1) experimental evaluation against LEGO-Net [1] on scene re-arrangement, i.e., recovering a close clean layout configuration from a messy / perturbed one. We use authors implementation [2], in the grad without noise setting (which is the best performing in the original paper), on the 3D-FRONT living rooms test subset and with a perturbation level of 0.250.25. Results highlights that our method is able to recover more realistic arrangements, while being closer to their initial configurations. This is remarkable as the LEGO-Net baseline has been specifically trained to perform this task.
  • Fsdh: Additional qualitative results on complex floor plans can be observed on (PDF Fig 2), further highlighting the robustness of our method and its consideration for the conditioning input.
  • Fsdh: As suggested, we provide additional visualization of the iterative denoising process over time (PDF Fig 3).

3. Implementation details

sfxx, Fsdh: We would like to emphasize that the vast majority of our implementation details are provided in the supplementary materials of our submission, notably in Section A (training parameterization and sampling hyper parameters) and Section B (network architecture, training protocol, baselines (re)implementation and LLM prompting).

Minor details which were omitted are as follows:

  • Our Shared Object Decoder linear layers have respective output dimensions 512, 128 and 8.
  • We use a popular PyTorch implementation [3] of PointNet for our floor plan feature extractor.
  • Our linear learning rate warmup uses a start factor of 0.1 and is active for the first 50 iterations. Then, the cosine annealing schedule is set to reach a minimum learning rate value of 1e81e^{-8} after 2200 epochs.
  • We implemented SSE using T=100T=100 trials and with noise levels sampled as in training (L459-460).

4. Text body and typos

  • Fsdh: We find reviewer proposal mentioning potential societal impacts of our method to be an excellent addition. We will also add that robust privacy measures should systematically be implemented along our method in order to avoid any unauthorized replication of personal spaces.
  • sfxx: As reported, missing references to Table 3 and Figure 6 in the text body make Section 4.4 appear empty in the submitted paper. We will clarify our manuscript by adding a concise textual introduction and analysis of these results. Notably, we can see in Table 3 that our lightweight architecture is bridging the gap with autoregressive methods in terms of inference efficiency. Figure 6 shows the denoising process of predicted bounding boxes being progressively unshadowed (left). It also allows to observe the variety of predicted layouts by plotting intermediate and final positions (colored and black dots respectively) of objects that are added to a scene over multiple trials (right).
  • sfxx: We fixed the mentioned typo along with other ones identified during post-submission readings of our manuscript.

[1] Lego-net: Learning regular rearrangements of objects in rooms, Wei et al., 2023

[2] https://github.com/QiuhongAnnaWei/LEGO-Net

[3] https://github.com/fxia22/pointnet.pytorch

最终决定

The original ratings for this paper were borderline accept (xNz4), weak accept (Fsdh), and strong reject (sfxx). All reviewers appreciated the evaluation results. The common concerns were the lack of ablation studies, additional experimental results, and implementation details. Additionally, reviewer sfxx uniquely raised concerns about the novelty and readability, which were contrary to the opinions of reviewers Fsdh and xNz4. After the rebuttal, all the previous concerns were well addressed. All reviewers increased their ratings to weak accept (xNz4), accept (Fsdh), and reject (sfxx). Reviewer sfxx once again uniquely raised additional concerns that were not mentioned in their initial review or prompted by the authors' rebuttal. The authors still adequately addressed these additional concerns in their response, but reviewer sfxx did not have the opportunity to increase their rating. For the reasons above, the AC recommend accepting this paper on the condition that all revisions from the first rebuttal are included in the camera-ready version.