Thank you for your valuable comments and suggestions. We will revise the paper accordingly and truly appreciate your recognition and encouragement.

Question 1: PolyRoom better on SceneCAD and cross-dataset generalization precision?

Answer:

PolyRoom builds on RoomFormer and leverages a segmentation module (Mask2Former) to generate initial query contours, making it a non-end-to-end method. Its strong performance on SceneCAD—a single-room dataset—largely benefits from the segmentation model, which already provides high-quality priors by the segmentation module. However, this reliance limits its ability to encode multiple-room layouts or perform global reasoning, often leading to room-level conflicts. While PolyRoom shows higher precision due to its dependence on segmentation quality, our method achieves superior recall and F1 on key metrics (Room IoU, Corner Recall, Angle Recall, and Angle F1), highlighting stronger generalization and structural robustness.

Question 2: How sensitive is the method to distance threshold selection for Type IV and others?

Answer:

The edge-to-polygon convertion is empirically stable with respect to the threshold. Type I intersections account for 97.06%, while Types II–IV cover 1.17%, 0.47%, and 1.29%. Shown in the following table, varying the threshold (0.0001 to 0.2× edge length) changes metrics by less than 0.05%, confirming robustness to the epsilon value.

eps	Room Prec.	Room Rec.	Room F1	Corner Prec.	Corner Rec.	Corner F1	Angle Prec.	Angle Rec.	Angle F1
0.2	99.59	98.69	99.14	95.04	88.57	91.70	92.57	86.40	89.39
0.1	99.59	98.69	99.14	94.99	88.55	91.66	92.51	86.37	89.34
0.05	99.59	98.69	99.14	94.96	88.55	91.65	92.46	86.34	89.30
0.01	99.59	98.69	99.14	94.95	88.58	91.66	92.40	86.34	89.27
0.0001	99.59	98.69	99.14	94.94	88.60	91.67	92.38	86.34	89.26

Question 3: Any intuition as to why SwinTransformer degrades in performance when used in isolation?

Answer:

In our experiments, SwinTransformer captures fine details but is more sensitive to noise, often resulting in irregular or self-intersecting polygons when used directly in the RoomFormer framework. In contrast, our method better leverages Swin’s global reasoning through the edge-based representation and dual-query strategy, enhancing robustness and yielding more stable and accurate layouts.

Question 4: Extremely limited runtime analysis – does CAGE have the same speed as RoomFormer? What does "lightweight architecture and no need for additional refinement" mean? [L261-262]

Answer:

CAGE has 40.96M parameters with a ResNet-50 backbone (vs. RoomFormer’s 40.60M), converges stably (650 epochs on Structured3D, 400 on SceneCAD), and achieves fast inference (~0.01s/image), matching RoomFormer and outperforming other baselines in speed. We will include a detailed runtime analysis in the revised version.
The phrase “lightweight architecture and no need for additional refinement” refers to our advantage over all other methods except RoomFormer. CAGE only performs a simple edge-to-polygon conversion, making it faster than models requiring complex refinement or optimization.

Question 5: Since the edge-based representation models room layout as fixed-length sequence, how does this scale with complexity of room layout? And, does this affect performance, both in terms of speed and accuracy?

Answer:

We use a configuration of 800 tokens (40 polygons × 20 edges), which is sufficient for most residential layouts. For example, Structured3D typically contains fewer than 10 rooms per floor, and SceneCAD includes only one room per case. As demonstrated in case 3410 (Fig. 10, supplementary), our model handles complex multi-room layouts effectively.
Our method is inherently scalable and built on a general framework. The primary constraint is token length, which is bounded by GPU memory. Scaling is feasible by increasing the token count, though it come with quadratic complexity (O(n²) in both runtime and memory), as is typical for transformer-based models.
For very larger scenes, our approach can be naturally extended by tiling the input into smaller subregions and merging the outputs—a widely used and effective strategy in applications such as cadastral mapping and topographic reconstruction.

Number of Tokens	Params(M)	FLOPs(G)	inference time (ms/image)	10*GPU（1Batch）
1200	41.1	26.4	4.7	12.6
1400	41.1	29.7	5.5	15.9
1600	41.2	33.2	6.1	19.8
1800	41.2	37.0	7.2	24.1
2000	41.3	41.1	8.1	28.8
2200	41.3	45.4	9.1	34.1
2400	41.4	50.0	10.2	39.8
2600	41.4	54.9	11.5	46.0
2800	41.5	60.0	12.8	52.6
3000	41.5	65.4	14.2	59.8
3200	41.6	71.0	15.6	67.4

Question 6: Why is there extreme variability in different checkpoints within the same training run? Is this due to convergence related issues of the model?

Answer:

We observed some variability across checkpoints, likely due to stochastic elements such as dropout. However, these fluctuations are minor, and overall convergence is stable. We repeatedly verified the experiments, and the results reported in the paper are consistent and robust across runs.

Question 7: How does multi-room floor-plan estimation work? Some visual examples would be helpful.

Answer:

We adopt a two-level token structure where tokens 0–m represent the first polygon, m+1–2m the second, and so on, allowing structured and efficient multi-room representation. We will include additional visual examples to better illustrate this in the revised version.

Question 8: Suggestion: the authors can incorporate training on synthetic datasets such as Aria Synthetic Environments (ASE) to improve generalization precision.

Answer:

Thank you for the suggestion. ASE’s large-scale synthetic dataset (100K scenes, ~28× larger than Structured3D) is indeed valuable. We are actively collecting diverse real and synthetic scenes, including ASE, to further enhance generalization in future extensions.

Question 9: Typos and Mis-links:

Answer:

Thank you for pointing these out. We will correct all typos and fix the broken references in the revised version.