PaperHub
5.0
/10
Poster4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.8
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

FFAM: Feature Factorization Activation Map for Explanation of 3D Detectors

OpenReviewPDF
提交: 2024-05-13更新: 2024-12-20

摘要

关键词
Explainable artificial intelligencevisual explanation3D object detection

评审与讨论

审稿意见
5

This paper proposes a feature factorization activation map to explain 3D detectors. It uses non-maximum matrix factorization (NMF) to obtain a global concept activation map and then refine it with feature gradients of an object-specific loss. A voxel upsampling strategy is further proposed to upsample sparse voxels to align the granularity between the activation map and input point cloud. Both quantitative and qualitative results validate the efficacy of the proposed method.

优点

  • The basic idea is easy to follow, and the comparison with previous works (both in 2D and 3D) is clear.
  • The key motivation and contributions are clearly presented.
  • The content of methodology and implementation are well organized.
  • The visualization results are impressive. Both the quantitative and qualitative results convincingly support the efficacy of the proposed explanation method.

缺点

The only concern is about the application, or the value in applications, of this proposed method. I understand this is a good adaptation and attempt to apply such explanation methods on LiDAR-based 3D detectors, but it is still unclear how it can benefit the procedure of improving 3D detectors or downstream applications. I would be more interested in how it produces a stronger 3D detector or a 3D detector with better controllability and safety, especially for safety-critical scenarios like autonomous driving.

In addition, there are also related areas or applications, such as 3D detection in indoor scenes (ScanNet, SUN RGB-D), etc, and camera-based/multi-modality 3D detection. I am also curious about the performance of such explanation methods when applied to those problems. How about the performance, and is there any new challenge?

问题

None.

局限性

None.

作者回复

We greatly appreciate your insightful feedback. Below please find our clarifications in response to your comments.

Q1. The only concern is about the application, or the value in applications, of this proposed method. I understand this is a good adaptation and attempt to apply such explanation methods on LiDAR-based 3D detectors, but it is still unclear how it can benefit the procedure of improving 3D detectors or downstream applications. I would be more interested in how it produces a stronger 3D detector or a 3D detector with better controllability and safety, especially for safety-critical scenarios like autonomous driving.

Authors’ Reply: We fully agree with your viewpoint. Our FFAM can be used to find the regions of interest of detectors on point cloud input. It also can provide visual explanation for different object attributes at a fine-grained level. In section 4.3, we utilize FFAM to reveal the detection mode of false positive predictions generated by detectors. There are three main observations in the experiment. First, we observe the average saliency maps of false positives exhibit similarities to those of true positives. The detector predicts a false positive because it detects a similar pattern to that of a true positive. Second, false positives tend to be surrounded by more noise points, with a point density of approximately one-third of true positives. We believe noises and sparse density may be significant factors contributing to the occurrence of false positives. Lastly, the ratio of car, pedestrian, and cyclist objects in true positives is approximately 36:5:2, while in false positives, it is 13:8:2. This suggests car objects are less prone to false positives compared to pedestrian and cyclist objects. We think that the above observations can help researchers to design more efficient and reliable 3D object detectors. We believe there are more applications of FFAM that can be explored. In our future work, we will use FFAM to improve the performance (including accuracy and speed) of 3D detectors.

Q2. In addition, there are also related areas or applications, such as 3D detection in indoor scenes (ScanNet, SUN RGB-D), etc, and camera-based/multi-modality 3D detection. I am also curious about the performance of such explanation methods when applied to those problems. How about the performance, and is there any new challenge?

Authors’ Reply: In this paper, our main focus is on explaining LiDAR-based 3D detectors on outdoor scenes. For indoor scenes or other modalities as input, as long as the intermediate features of the network can be directly linked to the input, our FFAM can be applied effectively. We believe that the most challenging task would be explaining 3D detectors based on multi-view cameras. This is because the correspondence between the intermediate features of such detectors and the input is relatively intricate. Many of them employ learning-based methods to elevate two-dimensional images into three-dimensional space.

评论

Thanks for the author's efforts in addressing my questions. Given the response, I can better understand the application and adaptation for other scenarios, but I think there are still limited new insights regarding how the proposed method can help improve current algorithms. The connection between the analysis and the proposed method is a little weak. Hence, I cannot raise my rating and would keep the original "borderline acceptance" recommendation.

评论

Thank you for your comments. The interpretability research on 3D object detection is still in its early stages of development. There is very little existing work in this area currently. The method we proposed has made significant progress compared to previous approaches. Our paper focuses more on the theoretical aspect, with relatively weaker applications. However, it still holds positive significance for advancing the interpretability of 3D object detection models.

审稿意见
5

The paper proposes a method called Feature Factorization Activation Map (FFAM) to provide visual explanations for 3D object detectors based on LiDAR data. This method addresses the interpretability issue in 3D detectors by using non-negative matrix factorization to generate concept activation maps and refining these maps using object-specific gradients. The approach is designed to handle the unique challenges of 3D point cloud data, such as sparsity and the need for object-specific saliency maps. The paper evaluates FFAM against existing methods and demonstrates its effectiveness through qualitative and quantitative experiments.

优点

  1. Using non-negative matrix factorization to generate concept activation maps is novel and well-justified for the application.
  2. The method is evaluated on multiple datasets and compared against state-of-the-art methods, demonstrating its superiority in producing high-quality visual explanations.
  3. The paper provides a clear and detailed description of the methodology, including the feature factorization, gradient weighting, and voxel upsampling processes.
  4. The proposed method has practical implications for improving the interpretability of 3D object detectors, which is crucial for applications in autonomous driving and robotics.

缺点

  1. The method involves several computationally intensive steps, such as non-negative matrix factorization and voxel upsampling, which may limit its applicability in real-time systems.
  2. The paper focuses primarily on LiDAR-based 3D detectors. Discussing how the method could be adapted or extended to other types of 3D data or detection systems would be beneficial.
  3. While the evaluation is comprehensive, it primarily focuses on two datasets (KITTI and Waymo Open). Additional datasets and scenarios could further validate the robustness and generalizability of the method.

问题

  1. Real-Time Applicability: How does the computational overhead of FFAM compare to existing methods in real-time applications, especially in autonomous driving scenarios?
  2. How sensitive is FFAM to the choice of detector? We noticed that two kinds of detectors were used in this paper. More comparison and analysis are necessary.
  3. Can FFAM be extended to other 3D detectors or modalities, such as RGB-D sensors or radar data? If so, what modifications would be necessary?

局限性

  1. FFAM requires access to the feature maps within 3D detectors, which may not always be possible, especially for proprietary or closed-source systems.
  2. The method's scalability to large-scale or real-time applications is not fully addressed. The computational requirements may be prohibitive for some practical applications.
  3. The method is tailored to LiDAR data and may not directly translate to other types of 3D data without significant modifications.
作者回复

We greatly appreciate your insightful feedback. Below please find our clarifications in response to your comments.

Q1. Real-Time Applicability: How does the computational overhead of FFAM compare to existing methods in real-time applications, especially in autonomous driving scenarios?

Authors’ Reply: Compared to the existing method OccAM, our approach has a significant advantage in speed. NMF and voxel upsampling are implemented by CUDA operators which could use GPU to accelerate the process. The main overhead of our FFAM lies in gradient backpropagation. Therefore, the delay of FFAM depends on the size of the model being explained. Currently, the backpropagation delay of most 3D object detectors is around 100ms. On the other hand, the existing method OccAM is a perturbation-based explanation method, which requires extensive sampling of inputs and then processing them one by one. As a result, this process is very slow. Typically, using OccAM to obtain an visual explanation requires several minutes. However, visual explanation methods (including our FFAM and OccAM) belong to post hoc explanations of the model. They are not used in real-time applications of autonomous driving. Therefore, we did not compare the speed of these methods in the paper.

Q2. How sensitive is FFAM to the choice of detector? We noticed that two kinds of detectors were used in this paper. More comparison and analysis are necessary.

Authors’ Reply: FFAM is not specific to detector architecture. It can be widely applied to existing LiDAR-based 3D detectors. In Appendix A.2, we apply FFAM to some other state-of-the-art detectors, such as DCDet, PV-RCNN, and Voxel R-CNN.

Q3. Can FFAM be extended to other 3D detectors or modalities, such as RGB-D sensors or radar data? If so, what modifications would be necessary?

Authors’ Reply: FFAM can be extended to extensive 3D detectors. In this paper, our main focus is on studying point clouds as the input format. FFAM has not yet been applied to other modalities. But we believe that RGB-D and radar data can be converted into point cloud format, so these modalities should also be feasible for FFAM to process.

Q4. FFAM requires access to the feature maps within 3D detectors, which may not always be possible, especially for proprietary or closed-source systems.

Authors’ Reply: We concur with your observation regarding the limitation in FFAM. We also pointed out this limitation of FFAM in the conclusion section. We believe that FFAM can primarily be used to provide researchers with some insights to improve detectors. In addition, it can also be used to reveal the internal working mechanism of detectors to users, thereby improving their understanding and trust in detectors.

评论

Dear reviewer v6wE, is there anything more you would like to ask of the authors, before the author-reviewer discussion period ends (tomorrow)?

评论

Thank you for the detailed responses. The rebuttal has addressed most of my previous concerns, I would keep my initial rating as borderline accept.

评论

Thank you for reviewing our paper and providing valuable feedback. Do you have any other unresolved issues? If possible, could you consider adjusting your rating to more accurately reflect your views on our work? We would greatly appreciate your support and suggestions.

审稿意见
5

This paper addresses the challenge of explanation and interpretability in 3D detection methods. It introduces a Feature Factorization Activation Map (FFAM), which utilizes non-negative matrix factorization (NMF) and object-specific gradient weighting to generate global and object-specific activation maps at the voxel level. An up-sampling method is subsequently employed to produce per-point activation maps. Extensive quantitative and qualitative experiments demonstrate the effectiveness of the proposed FFAM in generating saliency maps for various points over the previous method.

优点

  • The topic is interesting and warrants investigation. As discussed in Section 4.3, the research can significantly enhance 3D object detectors, particularly in identifying false positive modes.
  • The quantitative results are convincing and relevant to the topic.
  • Overall, the paper is well-written and easy to comprehend.
  • The code is made available, promoting transparency and reproducibility.
  • The paper discusses the limitation of the proposed method, specifically the necessity of accessing the feature map.

缺点

  • The rationale for using NMF is unclear. It appears to be a learning-based, PCA-like method to extract saliency from voxel features. How does it compare to the proposed method on the global activation map?
  • There is a lack of ablation studies, such as those examining the parameter 𝛾
  • The gradient weighting method is intriguing and seems sensible. Is this method original to your work?
  • It would be beneficial to use 𝑊 as a weight in Equation 2. The current version is a little confusing.
  • In Section 4.1, it is claimed that the saliency map generated by FFAM is superior to occAM because it is more focused on the object. Please provide further validation on why a clearer activation map is considered better.

问题

Please check weaknesses.

局限性

It would strengthen the paper if an example could be provided demonstrating how FFAM can help identify some non-trivial error modes and offer insights for improving the detector.

作者回复

We greatly appreciate your insightful feedback. Below please find our clarifications in response to your comments.

Q1. The rationale for using NMF is unclear. It appears to be a learning-based, PCA-like method to extract saliency from voxel features. How does it compare to the proposed method on the global activation map?

Authors’ Reply: Non negative matrix factorization (NMF) is a matrix factorization technique used to discover potential concepts in features. The process of NMF is to approximately decompose a non negative matrix VV into the product of two non negative matrices H and W, i.e. VHWV ≈ HW. Among them, WW and HH are non negative, which makes NMF more interpretable in many fields. For example, in facial recognition, the basis vectors in the WW matrix usually represent specific concepts such as nose, eyes, mouth, etc [1]. In our work, we utilize NMF to uncover latent concept within voxel features of 3D detectors. Typically, voxel features with effective detection clues in 3D detectors contain richer semantic concepts. Therefore, we could sum the activation coefficients of different concepts in HH matrix to obtain a global activation map. The basis vectors obtained by PCA do not have clear semantic concepts. Therefore, we have not compare our method with PCA-like method.

Q2. There is a lack of ablation studies, such as those examining the parameter rr.

Authors’ Reply: Due to page limitations in the main body of the paper, we have included the hyperparameter analysis and ablation experiments in the appendix. Please refer to Appendix A.1 and A.4 for details.

Q3. The gradient weighting method is intriguing and seems sensible. Is this method original to your work?

Authors’ Reply: As far as we know, we are the first one that use the gradient to refine a global activation map. The previous method that is closest to our method is ODAM. It also utilizes backward gradients to generate saliency maps. However, there are two main difference between our method and ODAM. Firstly, the usage of gradients is different. ODAM multiply the gradient map with mid feature maps, and then sum the values along the channel dimension to obtain the final saliency maps. While our FFAM utilizes the gradients to generate a weighting item as follows:

ω=k=1dGk,\omega = \sum_{k=1}^{d} \left| G_{\cdot k} \right|,

where GkG_{\cdot k} refers to the kk-th channel of gradient map GG, and dd denotes the number of channels. Secondly, ODAM is designed to generate visual explanations for image detectors. While our FFAM is used to explain LiDAR-based 3D detectors.

Q4. It would be beneficial to use WW as a weight in Equation 2. The current version is a little confusing.

Authors’ Reply: Equation 2 represents the process of solving for NMF. Since it is difficult to find a numerical solution for the non-negative matrix factorization of matrix AA, only an approximate solution can be obtained. We attempt to find a matrix A^\hat{A} that closely approximates AA, while also satisfying the product of two non-negative matrices HH and WW.

Q5. In Section 4.1, it is claimed that the saliency map generated by FFAM is superior to OccAM because it is more focused on the object. Please provide further validation on why a clearer activation map is considered better.

Authors’ Reply: In the context of visual explanation for 3D detectors, a clearer activation map is considered superior because it aids in better identifying the region of interest within the point cloud. We have conducted quantitative experiments to validate our method and previous explanation methods. As shown in Table 1 and Table 3, our FFAM achieves best results on VEA, PG and enPG metrics. These metrics reflect the degree of focus of an explanation method on an object. Furthermore, our FFAM performs best on the Deletion and Insertion metrics which are are widely used to evaluate explanation methods. As shown in Figure 5, our methods have the fastest performance drop and largest increase for Deletion and Insertion respectively, showing points highlighted in our saliency maps have a greater effect on detector predictions than the other methods.

Q6. It would strengthen the paper if an example could be provided demonstrating how FFAM can help identify some non-trivial error modes and offer insights for improving the detector.

Authors’ Reply: In section 4.3, we utilize FFAM to find the modes of false positives generated by a detector. There are three observations in the experiments. We believe these observations will provide some insights for researchers to improve 3D detectors. First, we observe the average saliency maps of false positives exhibit similar similarities to those of true positives. The detector predicts a false positive because it detects a similar pattern to that of a true positive. Second, false positives tend to be surrounded by more noise points, with a point density of approximately one-third of true positives. We believe noises and sparse density may be significant factors contributing to the occurrence of false positives. Lastly, the ratio of car, pedestrian, and cyclist objects in true positives is approximately 36:5:2, while in false positives, it is 13:8:2. This suggests car objects are less prone to false positives compared to pedestrian and cyclist objects.

[1] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. nature, 1999.

评论

Thanks for your responses. It solves most of my concerns. However, considering the technical soundness of the work, I cannot raise my rating and will keep my previous rating, boarderline acceptance.

评论

Thank you very much for recognizing our work and providing valuable feedback. We are glad to hear that our response has resolved most of your questions. May I ask what specific question you have regarding the technical soundness. We are willing to further discuss and eager to make improvements based on your valuable feedback.

评论

Dear reviewer bytz, is there anything more you would like to ask of the authors, before the author-reviewer discussion period ends (tomorrow)?

审稿意见
5

The paper proposes a “FFAM” for visual visualization of 3D Detectors. It introduces a non-negative matrix factorization (NMF) to decomposing 3D features into the product of two non-negative matrices. Besides, an object-specific loss is utilized to generate the object-specific saliency maps. Finally, the voxel upsampling is used to recover the resolution of the activation maps.

优点

  1. This work introduces NMF in explaining point cloud detectors and utilizes feature gradients of an object-specific loss to generate object-specific saliency maps.
  2. A voxel upsampling strategy is proposed to upsample sparse voxels.

缺点

  1. The description of Non-negative Matrix Factorization (NMF) is unclear. When the point cloud is large in scale, NMF may become unstable and exhibit a long convergence time.
  2. The core innovation of this paper lies in feature factorization. The authors directly chose NMF but did not provide a reason for this choice. For instance, Principal Component Analysis (PCA) can also achieve feature factorization.
  3. The object-specific gradient is element-wise multiplied with the global concept activation map to obtain the specific activation map of the object. In other words, the object-specific gradient inhibits the activation that do not belong to the current object. Therefore, it seems reasonable to directly use the weighting of the object-specific gradient as the activation maps.
  4. Since the gradient map G can already represent the current object, why not perform NMF on the gradient map and then use a method similar to obtaining a global activation map to derive the object activation map?
  5. Object-Specific Gradient Weighting is a general module. Can it be applied to other methods of generating activation maps, such as OccAM?
  6. Regarding voxel upsampling, the authors provide a limited description. I am curious about why they chose the Gaussian kernel and how it compares to other upsampling methods such as trilinear interpolation and transpose convolution.

问题

Please the weakness.

局限性

The authors mentioned the limitations and societal impact in the checklist.

作者回复

We greatly appreciate your insightful feedback. Below please find our clarifications in response to your comments.

Q1. The description of Non-negative Matrix Factorization (NMF) is unclear. When the point cloud is large in scale, NMF may become unstable and exhibit a long convergence time.

Authors’ Reply: In our experiments, Non-negative Matrix Factorization (NMF) maintains excellent stability and rapid convergence times across different scales of point cloud, encompassing those from the KITTI and Waymo Open datasets. The implementation of NMF is facilitated by well-established libraries, which can be seamlessly integrated with PyTorch code.

Q2. The core innovation of this paper lies in feature factorization. The authors directly chose NMF but did not provide a reason for this choice. For instance, Principal Component Analysis (PCA) can also achieve feature factorization.

Authors’ Reply: We agree with your perspective that Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) are both effective tools for feature decomposition and dimensionality reduction. However, the basis vectors derived from PCA can not represent a clear semantic concept, and thus are primarily used for data dimensionality reduction. In contrast, the basis vectors of NMF exhibit non-negativity and additivity, often representing specific concepts such as car doors and wheels. The DFF [1] is the first work that employs NMF to localize semantic concepts within images. Inspired by DFF, we utilize NMF to reveal the latent semantic concepts embedded in the intermediate voxel features of 3D object detectors. Typically, voxel features that provide effective cues for detection in 3D detectors possess richer semantic concepts, which can be used to create saliency maps.

Q3. The object-specific gradient is element-wise multiplied with the global concept activation map to obtain the specific activation map of the object. In other words, the object-specific gradient inhibits the activation that do not belong to the current object. Therefore, it seems reasonable to directly use the weighting of the object-specific gradient as the activation maps.

Authors’ Reply: We agree with your point that object-specific gradient can serve as the activation map. However, it only highlights the object-specific region in a point cloud and is struggle to differentiate the importance of point within the object-specific region. NMF can be used to help uncover the recognition pattern within the object-specific region. As shown in Figure 4, NMF assists in identifying the detector's distinct recognition patterns for various categories and object attributes. The quantitative results in Table 9 emphasize the enhanced performance achieved by integrating the object-specific gradient with NMF. This confirms the significant role of NMF in obtaining more nuanced and detailed visual explanations.

Q4. Since the gradient map G can already represent the current object, why not perform NMF on the gradient map and then use a method similar to obtaining a global activation map to derive the object activation map?

Authors’ Reply: The raw point features contain substantial semantics which is beneficial for NMF to uncover latent concepts within these features. However, the gradient map G only contains backward gradients for a specific object. Consequently, employing NMF to extract concepts from gradient maps lacks a clear conceptual basis.

Q5. Object-Specific Gradient Weighting is a general module. Can it be applied to other methods of generating activation maps, such as OccAM?

Authors’ Reply: we believe that incorporating object-specific gradient weighting into other methods is feasible. But we haven't tried combining the weighting module with other methods yet. Because other methods usually have their own approaches of obtaining object-level visual explanations. For example, OccAM is a perturbation-based method that involves randomly masking the input point cloud to assess performance changes in the output. As it inherently serves as an object-level explanation method, there is no immediate necessity to integrate it with object-specific gradient weighting. Consequently, we have not attempted to combine these techniques.

Q6. Regarding voxel upsampling, the authors provide a limited description. I am curious about why they chose the Gaussian kernel and how it compares to other upsampling methods such as trilinear interpolation and transpose convolution.

Authors’ Reply: Given that voxels are sparsely scattered throughout 3D space, identifying all neighboring points for a given point to be interpolated presents a challenge. Consequently, trilinear interpolation is not well-suited for handling sparse voxels. On the other hand, transpose convolution, being a learning-based technique, does not align as the optimal choice within our framework. In contrast, our voxel upsampling method initiates by searching for neighbors within a specified range and subsequently applies a Gaussian kernel to weigh them. This method is more adaptable for our FFAM.

[1] Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. Deep feature factorization for concept discovery. In ECCV, 2018.

评论

The authors have addressed my concerns. I raise my rating to borderline accept.

评论

Dear reviewer 93Lo, is there anything more you would like to ask of the authors, before the author-reviewer discussion period ends (tomorrow)?

评论

Dear reviewer 93Lo, do you have any further questions regarding my response?

最终决定

This paper presents a method for analyzing LiDAR-based 3D detectors. The main idea is to generate pointcloud-based and voxel-based saliency maps, revealing which points contribute most to a given 3D detection. These saliencies can also be broken down into contributions toward specific attributes, such as the individual parameters of a box. Before and after the rebuttal phase, reviews were borderline, but a solid “borderline accept” consensus emerged (with all reviewers giving this rating). Reviewer BgKE is unsure on how the analysis will be used to improve detection algorithms, but also notes that the saliency visualizations are “impressive”. Reviewers 93Lo and bytz raised some technical concerns, which appear to have been diligently resolved by the authors. Reviewer v6wE pointed out some limitations of the work, which the authors might work to clarify in the text, but these are largely clear already. Given the consensus positive reviews, the good results (quantitative and qualitative), and the improvements and clarifications provided in rebuttal, the AC recommends acceptance. The authors are encouraged to carefully revise the paper for camera-ready, especially to include the technical clarifications worked out during the rebuttal phase.