Context and Geometry Aware Voxel Transformer for Semantic Scene Completion
摘要
评审与讨论
This paper proposes CGFormer for semantic scene completion. It generates distinct voxel queries for different input images, instead of simply predefining a set of trainable parameters. The deformable cross attention is extend to 3D pixel space, avoiding sampling the same features for different projecting points. This method further enhances the 3D volume from both local and global perspectives. This paper incorporates stereo depth to improve the accuracy of depth probability by a depth refinement strategy.
优点
-
The code is submitted in the supplement material, whose guideline is detailed and easy to follow.
-
The proposed blocks bring notable performance gain, with elaborate experiments.
-
The CGFormer surpasses previous methods on both the SemanticKITTI and SSC-Bench-KITTI360 benchmarks.
-
The paper is well-organized the motivation is clear.
缺点
-
EfficientNetB7 contains much more parameters than the ResNet50 employed in the previous coarse-to-fine methods (e.g., VoxFormer, Symphonize). Comparing the performance under the same setting will be fairer.
-
Text error in Figure 1.
-
Although the paper presents both superior qualitative and visualization results, it would be better to provide several failure cases.
问题
Please see the weaknesses above.
局限性
In this paper, the limitations have been well discussed. The accuracy on most of the categories is unsatisfactory. Furthermore, there is a need to explore designing depth estimation networks under multi-view scenarios to extend the geometry-aware view transformation to these scenes. The method is worth further exploration.
Q1. Parameters and performance comparison between CGFormer and Symphonize.
To compare the performance of CGFormer with Symphonize with a comparable number of parameters, we revisited the design of our CGFormer and replaced the EfficientNetB7 with ResNet50 and the Swin blocks in the TPV branch with more lightweight residual blocks. The results on the semantickitti validation set are summarized in the table below. As shown in this table, our CGFormer is robust across different backbone networks and still outperforms Symphonize with a comparable number of parameters. This highlights CGFormer's potential and robustness.
| Model | IoU↑ | mIoU↑ | Parameters (M)↓ | Training Memory (M)↓ |
|---|---|---|---|---|
| CGFormer (EfficientNetB7, Swin Block) | 45.99 | 16.87 | 122.42 | 19330 |
| CGFormer (ResNet50, Swin Block) | 45.99 | 16.79 | 80.46 | 19558 |
| CGFormer (ResNet50, ResBlock) | 45.86 | 16.85 | 54.8 | 18726 |
| Symphonize | 41.92 | 14.89 | 59.31 | 17757 |
Q2. Text error in Figure 1.
Thanks for pointing out this mistake. We will correct it in the revised manuscript.
Q3. Failure cases.
Figure 2 in the uploaded PDF shows two examples of failure cases, where it is difficult to distinguish between adjacent cars. As seen in the RGB images, these objects are located in distant regions, making them challenging to differentiate. In the future, we plan to explore ways to improve the performance of our method in these far regions.
Thanks for the detailed rebuttal, which has addressed my concerns.
The parameters and performance comparison is convincing that can be added in the revised version.
As a result, I am willing to raise the score from Weak Accept to Accept.
Dear Reviewer Lm1y,
Thank you for raising your score. Your suggestions are valuable to improve the quality of our paper. We will update our manuscript in the revision.
Authors of Paper ID 3532
This manuscript studies the problem of road scene semantic scene completion from RGB images. The architecture and benchmarking frameworks follow a widely accepted literature. The innovation proposed here is better queries that are informed of the geometry and semantics of the input scene, a cross-attention variant that leverages rich information from different depth planes in the cost volume, and better information fusion from different represetnations. The whole architecture achieves state-of-the-art performance on public benchmarks.
优点
(1+) The paper aims to improve the query quality and solve the deep ambiguity and context-independence problems, which are important and motivated for the SSC task.
(2+) The paper proposes a novel method, the CGFormer, which introduce a context aware query generator to capture context -dependent queries and a novel Depth Net utilizing stereo depth and monocular depth for effective refinement.
(3+) Experiments on SemanticKITTI (table 1.) and SSCBench-KITTI360 (table 2.) show CGFormer outperforms prior methods. The motivation and the proposed components are validated by detailed ablation studies. (Table 3.,4.,5.)
(4+) The paper is basically clearly written and model architecture is easy to follow.
缺点
(1-) The paper notes that context-dependent query tends to aggregate information from the points within the region of interest, Ablation on the the number of the cross-attention and self-attention layers should be provided.
(2-) In alignment with previous methods, it would be better to demonstrate the performance when using only a monocular image as input (See Table 3. in VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion).
(3-) The visualization (figure 4. and figure 5.) shows some common scenes (It would be even better if the corresponding image views could be added). Could you show more performance improvements for objects like the bicycle in Figure 3 (due to capturing the regions of interest)?
问题
My questions are incorporated into the weakness section.
局限性
The manuscript is limited in scholarship and can be improved by incorporating semantic scene completion references like [A] and [B].
[A] Efficient semantic scene completion network with spatial group convolution, ECCV 2018
[B] Lode: Locally conditioned eikonal implicit scene completion from sparse lidar, ICRA 2023
W1. Ablation on the number of the cross-attention and self-attention layers.
We present the results of different configurations of cross-attention and self-attention layers on the semantickitti validation set in the table below. As shown in this table, the performance improves gradually with the increase of the number of the attention layers. However, once the number exceeds a certain threshold, the performance tends to stabilize. In alignment with previous methods, we set the number of cross-attention layers to and the number of self-attention layers to in the manuscript. This table also includes the results of the model without the CAQG module. Notably, using just one self-attention and one cross-attention layer with the CAQG module yields better performance than using two self-attention and three cross-attention layers without the CAQG module. This demonstrates the effectiveness of our proposed context and geometry-aware voxel transformer.
| Model | IoU↑ | mIoU↑ | Training Memory (M)↓ |
|---|---|---|---|
| 1 self, 1 cross | 44.95 | 16.24 | 16602 |
| 1 self, 2 cross | 45.12 | 16.26 | 16836 |
| 1 self, 3 cross | 45.22 | 16.40 | 16920 |
| 2 self, 3 cross | 45.99 | 16.87 | 19930 |
| 2 self, 4 cross | 45.86 | 16.74 | 19601 |
| 3 self, 4 cross | 45.76 | 16.77 | 22364 |
| w/o CAQG | 44.88 | 15.84 | 18959 |
W2. Performance using only monocular image.
Following VoxFormer and Symphonize, we replace the depth estimation network with AdaBins and present the results on the semantickitti validation set in the table below. To better evaluate the performance of our CGFormer, we also include the results of VoxFormer, Symphonize, and OccFormer. Compared to the stereo-based methods when using only a monocular image (VoxFormer, Symphonize), CGFormer achieves superior performance in terms of both IoU and mIoU. Furthermore, our method also surpasses OccFormer, the state-of-the-art monocular method.
| Model | IoU↑ | mIoU↑ |
|---|---|---|
| CGFormer (AdaBins) | 41.82 | 14.06 |
| VoxFormer-S (AdaBins) | 38.68 | 10.67 |
| VoxFormer-T (AdaBins) | 38.08 | 11.27 |
| Symphonize (AdaBins) | 38.37 | 12.20 |
| OccFormer | 36.50 | 13.46 |
W3. More visualizations.
Figure R1 in the uploaded PDF displays two visualization examples for objects with finer details. As shown in the RGB image, the sampling points of the context-dependent queries are typically situated within the region of interest. This allows CGFormer to capture much more clear details compared to other methods, highlighting the effectiveness of our proposed module. We will also provide the image view in the revised manuscript as done in this figure.
W4. The manuscript is limited in scholarship and can be improved by incorporating semantic scene completion references.
Thanks for your suggestion. We will incorporate these semantic scene completion references in the revised manuscript.
I'd like to thank the authors for the thoughtful rebuttal. I am glad to see the performance using only monocular image, which demonstrates that CGformer still performs best with a monocular image as input. It dispelled my doubts that there might be an unfair comparison with other methods. Futhermore, Fig. R1 shows two visualization examples or objects with finer details that are convincing. Based on the authors' responses and the comments of other reviewers I am willing to change my rating to weak accept.
Dear Reviewer wDiN,
Thank you again for your review. We are glad that our response has addressed the questions you raised.
Authors of Paper ID 3532
The authors present Context and Geometry Aware Voxel Transformer (CGFormer) for the semantic scene completion task. Their method extends the baseline VoxFormer with a Context-Aware Query Generator (CAQG), 3D deformable attention layers, a depth refinement block, and a dynamic fusion of voxel and TPV features. Experiments on two datasets demonstrated that CGFormer can achieve state-of-the-art mIoU performance.
优点
- solid experiments. Two datasets are experimented with and results tables are presented clearly along with many baseline methods.
- detailed ablation study for every designed module.
- The figures are well-organized and easy to follow, which is greatly appreciated.
缺点
-
The biggest concern to the reviewer is that the empirical results are not convincing enough to support the main claims of the proposed method. Specifically,
a) The authors claim that context-aware queries are a major novelty that helps the model perform better in its region of interest. However, the empirical results show that the performance is superior mainly in categories with larger areas such as the roads and sidewalks, while not as good in categories with finer details such as trucks, persons, and bicycles. The fact that CGFormer is comparable or worse compared to the baselines with context-independent queries on 50% of the categories makes the claims less compelling.
b) Besides a), the qualitative difference in Figure 3 is not illustrative. According to the first two columns, the context-dependent query should sample more at locations with details such as cars and bicycles, therefore, in the 3rd column, shouldn’t the context-aware queries be focused more on the white car? It is not clear why the points in (b) are irrelevant.
c) The model is named CGFormer, while how context-aware plays a part has been discussed in the paper, how the performance reflects geometry-aware is barely touched on.
-
Some minor issues with the writing. The proposed CAQG seems to be a key component in the model. However, its details are glossed over in Section 3.2 lines 137--146. Please consider elaborating on this part to emphasize the novelty.
问题
- details of the CAQG should be put in Section 3 since it is a key component of the proposed method.
- Table 3 shows that directly adding TPV slightly increases mIoU while hurting the IoU. This is interesting and can the authors provide their thoughts and analysis?
- In Table 3, LB-only, DF, and adding each of the TPV planes are ablated. Is it possible to also ablate the TPV branch only? Conceptually this is the only missing setting.
局限性
The authors have discussed the limitations in their appendix.
W1 (a). Performance gain of the context-aware queries.
Compared to methods that do not use temporal inputs or rely on much larger image backbone networks, CGFormer achieves the highest performance in 12 out of 19 categories on both the test and validation sets, as shown in Tables R2 and R3 of the uploaded PDF, encompassing both large-area and fine-detail objects, while the other methods excel in at most 3 to 4 categories. Although CGFormer shows limitations in several categories such as person and truck, it excels in most of the categories, including car, traffic sign, trunk, and pole. Indeed, CGFormer attains the top performance on the bicycle under the same condition.
To further validate the effectiveness of the proposed context-aware queries, we remove the CAQG module while keeping all other components unchanged. We then compare the performance of the two models across various categories. The results are presented in Table R3 of the uploaded PDF, with categories where performance improves after integrating the CAQG module highlighted in red. The CAQG module enhances performance in most categories (11 out of 19), including both large-area objects (sidewalk, terrain) and small objects (trucks, bicycles, person), while having minor effects on others. The above results demonstrate the efficacy of the proposed module.
W1 (b). More explanation of Figure 3 in the manuscript.
Thanks for your suggestion to help illustate our motivation. We will replace Figure 3 with more clear examples in the revised manuscript, as shown in the Figure R1 in the uploaded PDF. Here's a more detailed explanation for this figure. The yellow point represents the projected location of the visible voxel queries of the deformable cross-attention from 3D space onto the 2D image plane using the camera's intrinsic and extrinsic matrices. This projection aggregates information from the image features. For a voxel projected onto the image plane, the query should tend to aggregate information relevant to the semantics of the position where it lands. As shown in the left and middle columns, the yellow reference points fall onto the car and bicycle, and the sampling points of the context-aware query are mainly distributed within the regions of these two objects. In the right column, the query point projects onto the building, and the context-aware query points are primarily located on the building. In contrast, the context-independent query points are scattered in the region of the car.
W1 (c). More explanation for "geometry-aware".
The modules derived from depth information are referred to as "geometry-aware" in the title. We provide a more detailed discussion as follows.
(1) As mentioned in the manuscript (line 49-51), visible queries are projected onto the image plane to aggregate information by deformable cross-attention. However, when projecting the 3D points onto the image plane, many points may end with the same 2D position with similar sampling points on the 2D feature map, causing a crucial depth ambiguity problem. To address this issue, we extend deformable cross-attention from 2D to 3D pixel space, which allows to differentiate points with similar image coordinates by their depth coordinates, as illustrated in Figure 1 of the manuscript.
(2) Additionally, we introduce depth refinement to improve the accuracy of the estimated depth probability.
Model (a) in Table 3 provides ablation results for the 3D deformable cross-attention, and Table 5 includes ablation results for depth refinement.
W2 & Q1. More explanation for the CAQG module.
Thanks for pointing out this point, here is a revised version (line141-line144) of the manuscript.
To elaborate, the context feature and depth probability are first derived from the 2D image feature . Taking and as inputs, the query generator maps them from 2D image space to 3D space to generate context-aware voxel queries , where denotes the spatial resolution of the 3D volume. The query generator can be any explicit view transformation approach (e.g., voxel pooling, FLoSP, CaDDN). Table 4 provides ablation experiments for different methods.
Q2. More analyses for the TPV branch.
Thanks for your constructive question. We find additional insights regarding the TPV branch. We reanalyze the ablation experiments (models d, e, f, g in Table 4 of the manuscript) and discover that the performances were quite similar. To determine whether these performance gains are statistically significant across multiple training seeds, we conduct further experiments. As shown in Table R5 of the uploaded PDF, combining the outputs from both branches does enhance performance, but with some variability. We speculate that the TPV branch primarily focuses on global information, making it challenging to capture fine-grained voxel details. In contrast, the local branch enhances the fine-grained structural details of the 3D voxels. Simple addition treats all features as equally important, which may introduce side effects. Instead, weighting the more important features from each branch can enhance overall performance by leveraging their distinct strengths.
Q3. Ablation experiments for TPV branch.
Thanks for your suggestion to complete our experiments. Taking the model (b) of the table 3 in the manuscript as baseline, we present the ablation results for the TPV branch. As shown in the Table R4 in the uploaded PDF, enhancing the features with TPV branch boost the performance in terms of IoU, while minor improvement in terms of the mIoU, following the assumption that TPV focuses more on the global information.
Thank you for the detailed rebuttal.
I appreciate the explanations to my questions, the added experiments tables, the added ablation study, and the analysis on the TPV branch. Those have lifted most of my concerns.
Thank you for updating Figure 3 with a more detailed analysis. I find it improved over the previous version though the examples have been changed, understandably for better illustration.
Thank you for highlighting the performance gain in Table R3. While there are improvements in many categories, I still find the performance drop in others a bit concerning. But given the improved average IoU and mIoU, I think it should be OK. However, in future revisions, I encourage the authors to provide some failure cases in those categories to give more insights into the advantages and disadvantages of the proposed module.
I have also read the other reviews and separate rebuttals. All in all, I am willing to raise my rating to boarderline accept.
Dear Reviewer ryRH,
Thank you for raising your score. Your suggestions are valuable to improve the quality of our paper. We will update the overall manuscript and provide more failure cases, as displayed as the examples provided in the uploaded PDF.
Authors of Paper ID 3532
Dear Reviewer ryRH,
Thanks for your comments, which are valuable for improving the overall quality of this manuscript.
To address your major concerns, we compared the performance of CGFormer with other methods using consistent inputs and similar image backbones. We conducted ablation experiments to evaluate the performance gains of context-aware queries across different classes. The uploaded PDF includes a more detailed analysis of the figures and additional illustrative examples. We have also provided a thorough explanation of the origin of the term 'geometry aware' as mentioned in the title. Additionally, we have included ablation studies and analyses for the TPV branch.
Could we kindly ask if our responses have addressed for your concerns and if you have any new questions? Thanks for your time and effort.
Authors of Paper ID 3532
This paper proposes a state-of-the-art Semantic Scene Completion method called CGFormer. It introduces a Context and Geometry Aware Voxel Transformer that dynamically generates queries tailored to individual input images, addressing depth ambiguity through a 3D deformable cross-attention mechanism. The network leverages a multi-scale 3D representation by combining voxel and tri-perspective view to enhance both local semantic and global geometric information. State-of-the-art results are achieved in key benchmark tests.
优点
- This paper focuses on limitations of existing methods that utilizes a combination of voxel and tri-perspective view representations to capture both local details and global structures.
- The idea that different input images have unique contextual features is very interesting.
- The experiments are adequate with complete qualitative analysis, ablation experiments and visualization results.
- In the SemanticKITTI test, the proposed method improves most of the metrics.
缺点
- In the SSCBench-KITTI360 test set, CGFormer does not improve much compared to Symphonie. Symphonie has almost half of the metrics higher than CGFormer, and Symphonie seems has fewer parameters.
- Context and Geometry Aware Voxel Transformer seems to be a redesign of Symphonie and adds Deformable Self-Attention after Deformable Cross-Attention, which should be added with more explanation as to why it's done.
- Context net is not described in the main text or in the supplementary material.
- The table font is too small to read.
问题
Refer to the weaknesses.
局限性
The limitations of this paper are primarily in the accuracy of certain categories (e.g., pedestrians and bicyclists), which suggests that there is room for improvement in these areas, as outlined by the authors in Section 5.
W1. Parameter and performance comparison with Symphonize.
| Model | IoU↑ | mIoU↑ | Parameters (M) ↓ | Training Memory (M) ↓ |
|---|---|---|---|---|
| EfficientNetB7, Swin Block | 45.99 | 16.87 | 122.42 | 19330 |
| ResNet50, Swin Block | 45.99 | 16.79 | 80.46 | 19558 |
| ResNet50, ResBlock | 45.86 | 16.85 | 54.8 | 18726 |
| Symphonize | 41.92 | 14.89 | 59.31 | 17757 |
Thanks for your valuable suggestion. We revisit the architecture of CGFormer and discover some interesting new results. To compare its performance with Symphonize with a comparable number of parameters, we analyze the components of CGFormer, finding that replacing EfficientNetB7, used as the image backbone, and the Swin blocks, used in the TPV branch backbone, with more lightweight ResNet50 and residual blocks, respectively, can significantly reduce the number of parameters of our network. The results on the semantickitti validation set are presented in the table above. Compared to the original architecture, CGFormer maintains stable performance regardless of the backbone networks used for the image encoder and TPV branch encoder, underscoring its effectiveness, robustness, and potential.
Compared to Symphonize, lightweight CGFormer achieves an IoU of 45.86 and mIoU of 16.85, significantly surpassing Symphonize's IoU of 41.92 and mIoU of 14.89 on the SemanticKITTI validation set with a comparable number of parameters.
We retrain this lightweight model on the KITTI-360 dataset, with the detailed results of each class in Table R1 of the uploaded PDF. The lightweight version of CGFormer achieves an IoU of 47.78 and mIoU of 20.03, demonstrating a substantial improvement of 1.45 mIoU and 3.5 IoU over Symphonize. For specific categories, the original CGFormer architecture outperforms Symphonize in 9 out of 18 classes, and with the lighter backbones, it surpasses Symphonize in 10 out of 18 classes. These results further highlight the superiority of our approach.
W2. More explanation of context and geometry aware voxel transformer.
We apologize for any misunderstanding. Context and geometry aware voxel transformer is not the redesign of Symphonize. We will provide a more detailed explanation here. Existing coarse-to-fine (sparse-to-dense) methods, such as VoxFormer and MonoOcc, generally follow a pipeline that first aggregates 3D information for visible voxels using depth-based queries. These queries are defined as a set of learnable parameters, which are the same for all the input images. Subsequently, these methods complete the 3D information for non-visible regions using the reconstructed visible areas as starting points. The aggregation of information for visible voxels is accomplished through deformable cross-attention, while the completion of information in non-visible regions is handled by deformable self-attention, similar to the MAE [1].
For the proposed context and geometry-aware voxel transformer, we take into account the context of different images and introduce context-dependent queries. Instead of solely predefining a set of learnable parameters that primarily capture the overall distribution of the dataset, the context-dependent queries are related to the image content, allowing them to aggregate information from points within contextually relevant regions. Additionally, we extend deformable cross-attention from the 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates.
In contrast, Symphonize employs context-independent voxel queries for all the input images. It first condenses the image into a set of tokens with higher-level instance semantics, then the voxel queries serve as the query of the deformable cross attention while the set of tokens serve as the key and value. The original manuscript of Symphonize referred to this set of tokens as "queries", which may have led to some misunderstanding.
W3. Details of context net.
Sorry for the oversight of this module. The image encoder consists of a backbone network for extracting multi-scale features and a feature pyramid network to fuse them. We employ SECONDFPN [2] network here. It fuses the multiscale image features and output a feature map , with set to . The context net consists of several convolution blocks that take as input and reduce the number of channels. The generated context feature has a shape of . We will include these details in the revised manuscript.
W4. The table font is too small to read.
Thanks for your suggestion. We will fix it in the revised manuscript. Examples of tables with larger font are provided in the uploaded PDF.
[1] Masked autoencoders are scalable vision learners, CVPR 2022
[2] SECOND: Sparsely Embedded Convolutional Detection, Sensors 2018
Dear Reviewer jg8M,
Thanks for your comments, which are valuable for improving the overall quality of this manuscript.
We've provided additional architecture analysis on our proposed method and compared it with Symphonize under a comparable number of parameters. Besides, we provided more detailed explanations for the components of the network and examples of tables with larger fonts in the uploaded PDF.
Could we kindly ask if our responses have addressed for your concerns and if you have any new questions? Thanks for your time and effort.
Authors of Paper ID 3532
We appreciate the valuable comments of reviewers, which have greatly contributed to enhancing the quality of our paper. We are glad that the reviewers recognized various strengths of our work, including the clear motivation [wDiN] [Lm1y], interesting idea [jg8M], comprehensive experiments [ryRH, jg8M, Lm1y], good performance [jg8M, wDiN, Lm1y], clear writing [wDiN] [Lm1y], and easy-to-follow figures [ryRH, wDiN]. Additionally, Reviewer [Lm1y] highlighted that the provided guideline for the submitted code is detailed and easy to follow. We will answer the specific questions from each reviewer. We upload a PDF file with figures and tables which we will present in our rebuttal.
Dear reviewers,
We sincerely appreciate the dedication and constructive suggestions you’ve provided for our manuscript.
We have made efforts to address each of your comments, including conducting additional experiments and providing further explanations of our proposed method. Given the importance of the discussion between authors and reviewers, we kindly ask for your timely review of our response. We genuinely appreciate your feedback and look forward to any further discussions or suggestions you may have.
Best,
Authors of Paper ID 3532
The reviewers unanimously agree to accept the paper despite different degrees of excitement. The AC also agrees the merits warrant accepting the paper but also encourages the authors to revise the paper according to the review comments for the camera-ready revision.