Accelerating Non-Maximum Suppression: A Graph Theory Perspective
摘要
评审与讨论
The article "Accelerating Non-Maximum Suppression: A Graph Theory Perspective" explores a novel way to optimize Non-Maximum Suppression (NMS) in object detection using graph theory. The authors present two new methods, QSI-NMS and BOE-NMS, which greatly enhance the efficiency of NMS while maintaining mean Average Precision (mAP). They also introduce NMS-Bench, a benchmarking framework designed to evaluate different NMS methods effectively.
优点
- This article brings an interesting and fresh perspective to NMS by applying graph theory.
- The proposed QSI-NMS and BOE-NMS algorithms show impressive improvements in computational efficiency, with QSI-NMS achieving up to 10.7× speedup and BOE-NMS achieving 5.1× speedup, all without sacrificing mean Average Precision (mAP).
- One of the standout contributions is NMS-Bench, a benchmarking framework that standardizes the evaluation of NMS algorithms, potentially driving further advancements in the field.
- Additionally, the paper provides a detailed analysis of the intrinsic structure of NMS through the lens of graph theory, offering valuable insights into its computational bottlenecks and optimization strategies.
缺点
- In section 5.2 Results, we can find QSI-NMS operates with some performance decrease, and the paper doesn't thoroughly discuss what aspects of the algorithm might be causing this decline (maybe missing some data but exactly what is missed).
- The connection between the proposed methods and graph theory appears weak. QSI-NMS is essentially a quicksort-based algorithm, and the size of the weakly connected components (WCC) or their independence doesn't seem to significantly impact its performance. BOE-NMS, on the other hand, introduces a heuristic for the search space, which doesn't strongly tie into graph theory principles either.
- Regarding eQSI-NMS, there are questions about the clarity and effectiveness of the methods. The pseudo-code in Appendix E.2 is difficult to understand, and there might be an issue with the line "for s ∈ S do s ← −s;" in eQSI-NMS. It’s unclear how this part of the code functions and whether it introduces any problems.
- The evaluation is primarily limited to the YOLOv8-N model on the MS COCO 2017 dataset. To gain a more comprehensive understanding of the algorithms' generalizability, broader evaluations across various models and datasets would be beneficial.
问题
see weakness part for details
局限性
delineate the constraints of their proposed method in the paper.
We sincerely appreciate Reviewer HR1f's thoughtful review of our work and the insightful questions you have raised. Your feedback is invaluable in helping us improve and refine our research. We will address each of your concerns in detail.
Q1: The connection between the proposed methods and graph theory appears weak.
A1: In fact, our methods are closely linked to graph theory for the following two reasons:
The first reason is that the goal of the algorithms is to construct the graph or its approximate graph more quickly and efficiently (see Section 4). However, due to the highly skillful design of our algorithm, there is no need to explicitly construct the graph in implementation.
We explained in Section 4.1 (Lines 174-177) that QSI-NMS does not require explicit construction of the graph . If we use different pivot selection methods or partitioning criteria, graph construction becomes unavoidable. The situation is similar in BOE-NMS. If we do not sort by confidence scores, we can still obtain , and the result after dynamic programming will be the same. However, this approach would incur additional space and time overhead. We conducted an experiment to support our analysis, where we did not sort by confidence but performed dynamic programming at the end, keeping everything else unchanged. The results tested on YOLOv8-N are as follows:
| method | latency () | AP 0. 5:0. 95 (%) |
|---|---|---|
| Original NMS | 936. 9 | 37. 2 |
| BOE-NMS | 201. 8 | 37. 2 |
| BOE-NMS (with graph construction) | 558. 0 | 37. 2 |
The second reason is that the algorithms' design is based on the analysis of WCCs in the graph , ensuring the algorithm's effectiveness. In the analysis in Section 3(Lines126-135 and Figure 2), we identified two crucial characteristics of : the number of WCCs is large, and their sizes are generally small. The first characteristic inspired the design of QSI-NMS, and the second characteristic inspired BOE-NMS(Lines 131-135).
Specifically, QSI-NMS is based on the divide-and-conquer strategy, where subproblems do not interfere with each other. If there are few WCCs, the divide-and-conquer process will repeatedly split the same WCC, leading to accuracy issues. BOE-NMS must traverse each WCC, and the smaller the WCCs, the faster BOE-NMS will be. Assuming there are WCCs and the size of the -th WCC is , we can perform a formal analysis using the following inequality:
The right half of the inequality shows that BOE-NMS is always better than Original NMS, while the left half shows the lower bound of BOE-NMS optimization. Therefore, the more WCCs there are, the greater the optimization space for BOE-NMS, and when each is as small as possible, it will be closer to this lower bound.
In summary, both the number and size of WCCs affect our two methods. The main reason is that the more WCCs there are, the better the accuracy performance of QSI-NMS; the smaller the WCCs, the higher the efficiency of BOE-NMS.
Q2: In section 5. 2 Results, we can find QSI-NMS operates with some performance decrease, and the paper doesn't thoroughly discuss what aspects of the algorithm might be causing this decline (maybe missing some data but exactly what is missed).
A2: Your question is very insightful, and other reviewers are also interested in this issue. Therefore, we have included the response to this question in the global rebuttal. Please refer to Q1 of the global rebuttal for the details.
Q3: Regarding eQSI-NMS, there are questions about the clarity and effectiveness of the methods. The pseudo-code in Appendix E. 2 is difficult to understand, and there might be an issue with the line "for do ;" in eQSI-NMS. It’s unclear how this part of the code functions and whether it introduces any problems.
A3: There was indeed a mistake here, and we appreciate your feedback. What we intended to express is "for do " This is because we should first sort in ascending order. Next, by maintaining a stack, we can find the position of the last element greater than the current confidence score before it in time; let’s call this algorithm . Additionally, we need to find the first element greater than the current confidence score after it, which can be done by sorting in ascending order and then performing algorithm . This explanation is indeed prone to misunderstanding, and we will rewrite this part of the pseudo-code in our revised version. We will first sort in ascending order, then process it from front to back and back to front once each. However, our implementation is correct, and we suggest you refer to eQSINMS.hpp in the supplementary material. Our code is easy to understand.
Q4: The evaluation is primarily limited to the YOLOv8-N model on the MS COCO 2017 dataset. To gain a more comprehensive understanding of the algorithms' generalizability, broader evaluations across various models and datasets would be beneficial.
A4: We evaluated our methods on various datasets and models. On COCO 2017, we tested YOLOv8 models (one-stage, anchor-free model), YOLOv5 models (one-stage, anchor-based model), and Faster R-CNN models (two-stage, anchor-based model), as detailed in Table 1 of Section 5.2. We also tested the YOLOv8 models on the more mature and robust Open Images V7 dataset, as shown in Table 2.
The paper, "Accelerating Non-Maximum Suppression: A Graph Theory Perspective," presents a novel approach to enhancing the efficiency of the Non-Maximum Suppression (NMS) algorithm used in object detection. By analyzing NMS through graph theory, it introduces two new optimization methods: Quicksort Induced NMS (QSI-NMS) and Boxes Outside Excluded NMS (BOE-NMS). These methods leverage the structure of weakly connected components in a graph to reduce computational complexity and speed up the process, with minimal impact on the mean Average Precision (mAP). The paper also introduces NMS-Bench, a benchmarking framework to evaluate NMS methods rapidly.
优点
-
Innovative Approach: The paper applies graph theory to optimize NMS, a critical post-processing step in object detection, demonstrating significant improvements in computational efficiency.
-
Comprehensive Evaluation: It includes a robust evaluation using the newly developed NMS-Bench, providing detailed comparisons of performance improvements over traditional methods.
-
Practical Impact: The proposed methods, particularly eQSI-NMS, offer substantial speed increases with minimal loss in accuracy, which is highly beneficial for real-time object detection applications.
缺点
-
Complexity of Graph-Theoretical Analysis: The paper's reliance on graph theory might limit its accessibility to those without a background in this area. The proofs and theoretical explanations are dense and could be challenging to follow for non-specialists.
-
Limited Discussion on Scalability: While the paper shows efficiency improvements, it does not extensively discuss the scalability of the proposed methods across different hardware or larger datasets beyond those tested.
-
Dependency on Specific Conditions: The effectiveness of the proposed methods may depend heavily on the characteristics of the data and the specific architectures of the detection systems used, which may not generalize well to all types of object detection tasks.
问题
-
Details on Graph Construction (Line 99-104): The paper mentions the construction of graph G using bounding boxes and suppression relationships. Could you elaborate on the computational overhead of this graph construction process and its impact on the overall efficiency of NMS?
-
Experimental Setup (Line 260-265): The results presented are impressive; however, could more information be provided on how the different configurations of bounding boxes were handled during the experiments, especially concerning their distribution and density?
-
Proof of Theorem 1 (Line 510-520): Could you clarify how the dynamic programming approach adapts to variations in graph structure, particularly for non-standard configurations of bounding boxes?
局限性
Complexity for Non-Specialists: The application of graph theory to optimize the Non-Maximum Suppression (NMS) process introduces complex mathematical concepts and proofs that might not be easily accessible or understandable for practitioners or researchers without a background in graph theory. This complexity could limit the broader application and adaptation of the proposed methods in the field.
Dependence on Data Characteristics: The effectiveness of the proposed QSI-NMS and BOE-NMS methods is highly dependent on the specific characteristics of the data, such as the distribution and density of bounding boxes. This dependency might restrict the generalizability of the methods across different object detection tasks and datasets where these characteristics vary significantly.
Thank you for your detailed review and valuable feedback. We appreciate your recognition of our innovative approach, comprehensive evaluation, and practical impact. Below, we will address each of your concerns one by one.
Q1: Details on Graph Construction (Line 99-104): The paper mentions the construction of graph G using bounding boxes and suppression relationships. Could you elaborate on the computational overhead of this graph construction process and its impact on the overall efficiency of NMS?
A1: The computational overhead of constructing the graph in our algorithms is zero. Constructing the graph and topological sorting incurs an additional computational overhead. Although this does not affect the complexity of our methods, it can slow down their practical performance. To address this issue, we design our algorithm such that the order of the nodes we iterate over follows a valid topological sort. Therefore, dynamic programming can be completed without explicitly constructing the graph. As a result, our algorithm does not incur the extra overhead of graph construction.
For QSI-NMS, lines174-177 explain why explicit graph construction is unnecessary. The same applies to BOE-NMS, where sorting the boxes by confidence scores from high to low already provides a valid topological sort. We can also quickly locate the mutually suppressing boxes withoud sorting, thereby obtaining , and performing dynamic programming yields the same results.
However, In BOE-NMS, not sorting in advance would introduce additional computational overhead of constructing the graph and topological sorting. We conducted an experiment to support our analysis. We did not sort by confidence but performed dynamic programming at the end while keeping everything else unchanged. The results on YOLOv8-N are shown in the table below:
| method | average latency () | AP 0.5:0.95 (%) |
|---|---|---|
| Original NMS | 936.9 | 37.2 |
| BOE-NMS | 201.8 | 37.2 |
| BOE-NMS (with graph construction) | 558.0 | 37.2 |
Q2: Experimental Setup (Line 260-265): The results presented are impressive; however, could more information be provided on how the different configurations of bounding boxes were handled during the experiments, especially concerning their distribution and density?
A2: Our experimental data is derived from the actual outputs of commonly used models on public datasets, as described in Section 5.1. Each bounding box is represented by to denote its position and size, along with a confidence score , as defined in Section 2. For detailed statistics related to the bounding boxes, please refer to Appendix D.
Q3: Proof of Theorem 1 (Line 510-520): Could you clarify how the dynamic programming approach adapts to variations in graph structure, particularly for non-standard configurations of bounding boxes?
A3: As proven in Theorem 1, variations in the graph structure do not affect the correctness of the results obtained through dynamic programming. However, the structure of the graph can influence the efficiency of dynamic programming since its complexity is . By pre-sorting, the complexity can be reduced to . Therefore, when the graph has many nodes or edges, the efficiency of dynamic programming decreases, especially in cases where all bounding boxes are very dense (which are rare in real-world data).
Q4: Complexity of Graph-Theoretical Analysis: The paper's reliance on graph theory might limit its accessibility to those without a background in this area. The proofs and theoretical explanations are dense and could be challenging to follow for non-specialists.
A4: We understand that the graph-theoretical aspects may be complex. To address this, we've included detailed explanations and supplementary materials to clarify the key points(see Appendix A). Additionally, we emphasize the practical applications to highlight the relevance of our work.
Q5: Limited Discussion on Scalability: While the paper shows efficiency improvements, it does not extensively discuss the scalability of the proposed methods across different hardware or larger datasets beyond those tested.
A5: We did not have enough time to test on different hardware environments, but our methods indeed reduce computational overhead (see Appendix D.3, Figure 6). Moreover, we implemented our methods in the torchvision library, and their performance surpasses that of highly parallel CUDA NMS (see Appendix D.3), demonstrating that the efficiency of our methods does not rely on specific hardware and can be applied to low-power edge devices.
Additionally, we tested on a larger and more robust dataset, Open Images V7. Please refer to Section 5.2, Table 2 for details.
Q6: Dependency on Specific Conditions: The effectiveness of the proposed methods may depend heavily on the characteristics of the data and the specific architectures of the detection systems used, which may not generalize well to all types of object detection tasks.
A6: We believe that our analysis of the characteristics of is widely applicable (Lines 126-135).
This paper presents a method from a new perspective to enhance the efficiency of the Non-Maximum Suppression (NMS) algorithm with affordable accuracy decrease.
The authors introduce a novel perspective by analyzing NMS through the lens of graph theory, revealing its intrinsic structure as a directed acyclic graph.
This insight leads to the development of two NMS optimization methods: QSI-NMS and BOE-NMS.
Furthermore, the authors also introduce NMS-Bench, an end-to-end benchmark that facilitates rapid and comprehensive validation of various NMS algorithms.
优点
(1)The paper innovatively applies graph theory to NMS, offering a detailed theoretical basis for the proposed QSI-NMS and BOE-NMS algorithms, which excel in enhancing speed while preserving accuracy. (2)The experimental results does support the effectiveness of the efficiency of the proposed algorithm. (3)The construction of the benchmark is a solid contribution to the community. (4)The paper offers a new perspective and tool for understanding and optimizing the NMS step in object detection.
缺点
(1)A case study that illustrate the overly suppressed samples is welcomed. (2) How was the average latency calculated, more information if preferred. This is not clear. (3) How is the performance of the proposed algorithm on the Yolo V10 and mask RCNN. (4) Although the proposed method is tested on the proposed NMS-Bench, experimental results in detection bench-mark algorithms are still prefered.
问题
(1) How was the average latency calculated, more information if preferred. Please report the time of graph construction and the time of NMS, respectively. (2) How is the performance of the proposed algorithm on the Yolo V10 and mask RCNN.
局限性
The effectiveness of the proposed should be evaluated on more real world cases.
More detailed analysis of which kinds of samples are miss detected by the algorithm is missing.
Thank you for your valuable feedback and for acknowledging the value of our work. We will address your questions and concerns in the following responses.
Q1: A case study that illustrates the overly suppressed samples is welcomed.
A1: This question is fundamental and others are also interested in this issue. The proof of Theorem 4 demonstrates that BOE-NMS does not overly suppress boxes compared to Original NMS, leading to the same results. In QSI-NMS, however, in some special cases, nodes within the same WCC might be incorrectly assigned to different subproblems, resulting in retaining some boxes that should have been suppressed. To provide a comprehensive and clear explanation, we have included our response in the global rebuttal. We kindly ask you to review our response in A1 of the global rebuttal.
Q2: How was the average latency calculated, more information if preferred. This is not clear. Please report the time of graph construction and the time of NMS, respectively.
A2: The latency calculation begins from the input of bounding boxes and ends when the retained bounding boxes are output. For a dataset containing images, latency is measured by using the bounding boxes generated per image as input, and the total latency for the images is averaged. To mitigate random errors, this measurement is repeated 5 times, and the average of these measurements is used as the final average latency. Thus, the average latency is the average time taken to execute NMS.
We do not need any additional time to construct the graph. Constructing the graph and topological sorting incurs an additional computational overhead. Although this does not affect the complexity of our methods, it can slow down their practical performance. To address this issue, we design our algorithm such that the order of the nodes we iterate over follows a valid topological sort. Therefore, dynamic programming can be completed without explicitly constructing the graph. As a result, our algorithm does not incur the extra overhead of graph construction.
For QSI-NMS, lines 174-177 explain why explicit graph construction is unnecessary. The same applies to BOE-NMS, where sorting the boxes by confidence scores from high to low already provides a valid topological sort. We can also quickly locate the mutually suppressing boxes without sorting, thereby obtaining , and performing dynamic programming yields the same results.
However, In BOE-NMS, not sorting in advance would introduce additional computational overhead of constructing the graph and topological sorting. Please check the table below:
| method | average latency () | AP 0.5:0.95 (%) |
|---|---|---|
| Original NMS | 936.9 | 37.2 |
| BOE-NMS | 201.8 | 37.2 |
| BOE-NMS (with graph construction) | 558.0 | 37.2 |
Q3: How is the performance of the proposed algorithm on the Yolo V10 and mask RCNN.
A3: At the time of conducting this research, YOLOv10 had not yet been open-sourced, and we currently do not have sufficient time to include experiments with it. However, we tested our methods on the Instance segmentation task using Mask R-CNN and YOLOv8, where our methods also demonstrated significant superiority over others. Please refer to the tables below.
Table 1: Latency () of various methods
| Original NMS | Fast NMS | Cluster-NMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|---|
| Mask R-CNN(R50-FPN) | 48.8 | 105.7 | 222.8 | 31.9 | 30.3 | 21.6 |
| Mask R-CNN(R101-FPN) | 45.3 | 113.1 | 205.0 | 32.3 | 29.8 | 21.5 |
| Mask R-CNN(X101-FPN) | 40.3 | 105.6 | 189.2 | 26.7 | 26.6 | 19.3 |
| YOLOv8n-seg | 1265.3 | 366.4 | 859.4 | 219.5 | 153.4 | 85.4 |
| YOLOv8s-seg | 740.0 | 269.2 | 736.2 | 158.6 | 115.8 | 61.9 |
Table 2: Box mAP(%) of various methods
| Original NMS | Fast NMS | Cluster-NMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|---|
| Mask R-CNN(R50-FPN) | 41.0 | 40.4 | 40.9 | 41.0 | 40.8 | 40.4 |
| Mask R-CNN(R101-FPN) | 42.9 | 42.2 | 42.9 | 42.9 | 42.7 | 42.3 |
| Mask R-CNN(X101-FPN) | 44.3 | 43.6 | 44.2 | 44.3 | 44.1 | 43.7 |
| YOLOv8n-seg | 36.7 | 36.5 | 36.7 | 36.7 | 36.6 | 36.4 |
| YOLOv8s-seg | 44.7 | 44.5 | 44.7 | 44.7 | 44.6 | 44.4 |
Table 3: Mask mAP(%) of various methods
| Original NMS | Fast NMS | Cluster-NMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|---|
| Mask R-CNN(R50-FPN) | 37.2 | 36.9 | 37.1 | 37.2 | 36.9 | 36.7 |
| Mask R-CNN(R101-FPN) | 38.6 | 38.4 | 38.6 | 38.6 | 38.4 | 38.2 |
| Mask R-CNN(X101-FPN) | 39.5 | 39.3 | 39.5 | 39.5 | 39.3 | 39.1 |
| YOLOv8n-seg | 30.4 | 30.4 | 30.4 | 30.4 | 30.3 | 30.2 |
| YOLOv8s-seg | 36.7 | 36.6 | 36.7 | 36.7 | 36.5 | 36.5 |
Q4: Although the proposed method is tested on the proposed NMS-Bench, experimental results in detection bench-mark algorithms are still preferred.
A4: To demonstrate the practical value of our methods, we not only tested them on NMS-Bench but also implemented them in torchvision. The results show that our methods are more efficient than the highly parallel CUDA NMS, which is the currently used program for NMS. For detailed information and experimental results, please refer to Appendix D.3.
Thanks for the detailed responses and additional experiments, which solve my concerns and questions.
This work focuses on improving the latency of Non-Maximum Suppression (NMS), a crucial step for nearly all object detectors. The work analyzes NMS, as a directed acyclic graph (DAG) treating bounding boxes as nodes, and suppression relationships as arcs allowing NMS solutions based on dynamic programming. Based on this graph interpreation, the work proposes two new approximate versions of NMS, named QSI-NMS, and BOE-NMS with different precision v/s latency tradeoffs. Finally, the proposed NMS approaches are evaluated on a new benchmark NMS-bench, showing improved latency with little to no mAP loss.
优点
- The paper is written adequately, and offers a new graph theory perspective for non-maximum suppression, further exploring avenues for new research in the area.
- The proposed approaches show improved latency compared to other NMS approaches including original (greedy NMS), Fast NMS and Cluster NMS, while maintaining mAP. This is achieved without fine-tuning the underlying model.
缺点
- The paper does not compare against (or even cite) other approximations to NMS proposed in the literature such as MaxPoolNMS[1], PSRR-MaxpoolNMS [2] or ASAP-NMS [3]. It's unclear how the contribution, and performance of proposed approximations differs from the literature. For example, the idea behind BOE-NMS, using locality b/w suppression relationships is already explored in the above works.
- Can this approach be utilized for two-stage object detectors as well for both stages of NMS? The proposed benchmark only applies the methods this for single-stage object detectors, it's unclear how the method performs on single stage detectors.
[1] Cai, Lile, et al. "Maxpoolnms: getting rid of nms bottlenecks in two-stage object detectors." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [2] Zhang, Tianyi, et al. "Psrr-maxpoolnms: Pyramid shifted maxpoolnms with relationship recovery." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. [3] Tripathi, Rohun, et al. "Asap-nms: Accelerating non-maximum suppression using spatially aware priors." arXiv preprint arXiv:2007.09785 (2020).
问题
- The worst case complexity of the proposed approaches is still O(n log n), while other approaches, stated above claim O(n) complexity. Does this imply the other approaches scale better?
局限性
The work does an adequate job of discussing the limitations of the proposed work.
Thank you for your feedback and for recognizing the value of our new graph theory perspective on non-maximum suppression. We will address your concerns in the following responses.
Q1: The worst case complexity of the proposed approaches is still , while other approaches, stated above claim complexity. Does this imply the other approaches scale better?
A1: Among the papers you mentioned, only PSRR-MaxpoolNMS claims to achieve complexity. However, we do not believe that PSRR-MaxpoolNMS has better scalability compared to our methods for two reasons:
First, we implemented PSRR-MaxpoolNMS and conducted efficiency comparisons. To better illustrate the impact of bounding box count on runtime, we performed experiments on YOLOv5-N, which has the highest number of bounding boxes, as shown in Figure 3 (a) of the global rebuttal PDF. Original NMS has the highest time cost due to its quadratic growth. For a clearer comparison between our methods and PSRR-MaxpoolNMS, we excluded Original NMS, as shown in Figure 3 (b). As the number of boxes increases, PSRR-MaxpoolNMS is faster than BOE-NMS and QSI-NMS but consistently slower than eQSI-NMS.
Second, strictly speaking, the complexity of PSRR-MaxpoolNMS is not . PSRR-MaxpoolNMS requires prior knowledge of the input image size and generates confidence score maps related to the size of the image. This aspect is not considered in the complexity analysis (whereas our methods are designed and analyzed independently of the image size). During the Channel Recovery stage of PSRR-MaxpoolNMS, the complexity of computing the nearest distances for channel mapping is , where and represent the number of anchor box scales and ratios, respectively. These quantities vary with different datasets and detectors and increase as image size and object count increase. The remaining stages: Spatial Recovery, Pyramid MaxpoolNMS, and Shifted MaxpoolNMS can all be completed in time. Thus, the overall complexity of PSRR-MaxpoolNMS is .
Q2: The paper does not compare against (or even cite) other approximations to NMS proposed in the literature such as MaxPoolNMS, PSRR-MaxpoolNMS or ASAP-NMS. It's unclear how the contribution, and performance of proposed approximations differs from the literature.
A2: Our research focuses on NMS methods applicable to general cases, so we inadvertently missed these important papers. Thank you very much for pointing this out. We will cite these papers in our revised version. Next, we will illustrate how our contributions differ from these papers through the following comparisons:
Maxpool NMS and ASAP-NMS can only be used in the first stage of two-stage detectors, while PSRR-MaxpoolNMS, although applicable to various stages, is limited to anchor-based models. Our methods, however, can be used in any stage of any detector because we address general cases (see Section 2).
The complexity of Maxpool NMS is , ASAP-NMS is , and the complexity of PSRR-MaxpoolNMS has been analyzed in A1. None of these methods are more efficient than eQSI-NMS.
These methods involve many manually defined hyperparameters and are complex to implement, which limits their generalization across different models and datasets. In contrast, our methods require no additional parameters beyond those in Original NMS and are easy to implement.
Q3: The idea behind BOE-NMS, using locality suppression relationships is already explored in the above works. What is the difference between BOE-NMS and these methods?
A3: Although these methods also consider locality, BOE-NMS employs a fundamentally different approach and analysis. Not suppressing non-overlapping boxes is a very intuitive idea, but the key is how to locate these boxes. We discuss this issue in Appendix C.3 and prove that directly determining non-overlapping boxes is a more difficult problem than checking whether the centroid of one box falls within another box (the basic idea of BOE-NMS), as shown in Appendix B.6. Therefore, BOE-NMS can be efficiently implemented. In contrast, methods in the literature use complex techniques to find approximate solutions to this problem. Additionally, BOE-NMS is rigorously proven not to cause any loss of accuracy (Theorem 4), whereas the methods in the literature are not as rigorous and exhibit some degree of accuracy degradation.
Q4: Can this approach be utilized for two-stage object detectors as well for both stages of NMS? The proposed benchmark only applies the methods this for single-stage object detectors, it's unclear how the method performs on two-stage detectors.
A4: Yes, our methods are applicable to various stages of various detectors. We tested not only one-stage detectors but also two-stage detectors (Faster R-CNN). Please refer to Table 1 in Section 5.2. We have also added experiments for the first stage in Faster R-CNN. The results demonstrate that our methods still exhibit strong performance advantages, with QSI-NMS and eQSI-NMS even surpassing Original NMS in terms of accuracy. Please see the tables below.
Table 1:R50-FPN
| Original NMS | Fast NMS | Cluster-NMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|---|
| mAP(%) | 40.2 | 39.6 | 40.2 | 40.2 | 40.3 | 40.3 |
| latency() | 14768 | 2469 | 4501 | 2341 | 950 | 457 |
Table 2:R101-FPN
| Original NMS | Fast NMS | Cluster-NMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|---|
| mAP(%) | 42.0 | 41.3 | 42.0 | 42.0 | 42.1 | 42.0 |
| latency() | 13089 | 2411 | 4477 | 2325 | 995 | 464 |
Table 3:X101-FPN
| Original NMS | Fast NMS | Cluster-NMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|---|
| mAP(%) | 43.0 | 42.4 | 42.9 | 43.0 | 43.1 | 43.1 |
| latency() | 12583 | 2383 | 4463 | 2265 | 984 | 457 |
I thank the authors for their detailed response. My main concern regarding comparison with prior work, especially PSRR-MaxPoolNMS has been partially addressed in the rebuttal.
The proposed work provides a new perspective for NMS using graph theory, and the best method offered eQSI-NMS offers improved latency, while it's still unclear how much it trades-off in terms of performance compared to PSRR-MaxPoolNMS. I've improved my rating, as my perception of the paper has improved.
I would still like to see detailed mAP comparison on NMS-bench against PSRR-MaxPoolNMS. If the authors have implemented the algorithm, I don't see why mAP performance has not been reported in the rebuttal for PSSR-MaxPoolNMS?
We sincerely appreciate the time and effort you have invested in reviewing our rebuttal, and we are pleased that our response has addressed your main concern.
We are also more than happy to provide a comparison of the mAP performance comparison between our methods and PSRR-MaxpoolNMS. We apologize for being unable to fully present this in the rebuttal, as we had reached the character limit. We tested anchor-based models (Faster R-CNN, YOLOv5) where PSRR-MaxpoolNMS is applicable, using the MS COCO dataset. Please see the results in the tables below.
Table 1: Faster R-CNN R50-FPN (average #bounding boxes: 251)
| Original NMS | PSRR-MaxpoolNMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|
| mAP(%) | 39.8 | 37.5 | 39.8 | 39.5 | 39.3 |
| latency() | 53.0 | 89.0 | 43.1 | 34.3 | 24.6 |
Table 2: Faster R-CNN R101-FPN (average #bounding boxes: 236)
| Original NMS | PSRR-MaxpoolNMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|
| mAP(%) | 41.8 | 39.5 | 41.8 | 41.5 | 41.4 |
| latency() | 45.5 | 86.4 | 38.6 | 31.9 | 23.0 |
Table 3: Faster R-CNN X101-FPN (average #bounding boxes: 214)
| Original NMS | PSRR-MaxpoolNMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|
| mAP(%) | 43.0 | 40.5 | 43.0 | 42.7 | 42.5 |
| latency() | 37.0 | 86.9 | 33.4 | 28.6 | 20.6 |
Table 4: YOLOv5-N (average #bounding boxes: 2898)
| Original NMS | PSRR-MaxpoolNMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|
| mAP(%) | 27.8 | 26.5 | 27.8 | 27.5 | 27.4 |
| latency() | 8568.5 | 599.6 | 906.2 | 628.0 | 325.6 |
Table 5: YOLOv5-S (average #bounding boxes: 1974)
| Original NMS | PSRR-MaxpoolNMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|
| mAP(%) | 37.2 | 35.6 | 37.2 | 36.9 | 36.6 |
| latency() | 3858.2 | 409.4 | 547.7 | 408.2 | 217.5 |
Table 6: YOLOv5-M (average #bounding boxes: 1810)
| Original NMS | PSRR-MaxpoolNMS | BOE-NMS | QSI-NMS | eQSI-NMS | |
|---|---|---|---|---|---|
| mAP(%) | 45.1 | 43.1 | 45.1 | 44.9 | 44.5 |
| latency() | 2918.1 | 380.5 | 424.7 | 371.3 | 197.4 |
As you can see, eQSI-NMS achieves the lowest latency while maintaining a favorable trade-off with mAP. However, PSRR-MaxpoolNMS experiences a % mAP accuracy loss in both the Faster R-CNN and YOLOv5 models. This is likely due to the following reasons:
- Some hyperparameters need adjustment when using the MS COCO dataset, as MS COCO is more complex compared to PASCAL VOC, with a richer variety of categories and broader scenes. Therefore, the number of scales and ratios may need to be increased to suit the MS COCO dataset. In contrast, our proposed methods do not require additional parameters beyond those used in Original NMS, making them more applicable to general cases.
- The PSRR-MaxpoolNMS paper does not mention how the input image size and the settings of and in PSRR-MaxpoolNMS are determined. In our implementation, we set this to to accommodate all images in the MS COCO dataset. This might affect accuracy, though we believe the impact is minimal. This also indicates that our method offers better generalizability.
In the case of Faster R-CNN, the latency performance of PSRR-MaxpoolNMS is not competitive. This is because PSRR-MaxpoolNMS requires 8 maxpooling operations, which, although not affecting the algorithm's complexity, introduces a large constant factor that hampers efficiency when the number of bounding boxes is small (e.g., the average number of bounding boxes in the three Faster R-CNN models is less than 300). However, it performs well when the number of bounding boxes is large (e.g., YOLOv5-S has an average of 2898 bounding boxes). This demonstrates that the speedup of PSRR-MaxpoolNMS is highly dependent on the degree of parallelism, whereas our method directly reduces computational overhead (see Figure 6 in Appendix D.3), making it hardware-agnostic and suitable for resource-constrained edge devices.
In summary, the limitations of PSRR-MaxpoolNMS in improving efficiency while balancing accuracy are significant. In contrast, our method, which uses the same inputs and parameters as Original NMS, is a plug-and-play algorithm that can directly replace Original NMS. To demonstrate this, we implemented our method in the torchvision library and compared it with the highly parallel CUDA NMS [1], where our method exhibited significant superiority (see Table 7 in Appendix D.3).
Thanks again for the time you invested in writing your comments. We hope this response can thoroughly address your concern.
[1] TorchVision maintainers and contributors. TorchVision: PyTorch’s Computer Vision library, November 2016.
Thank you for the detailed response addressing my concerns. I've increased my rating accordingly.
We appreciate the reviewers' thoughtful feedback on our work. We are grateful for the reviewers' recognition of the following strengths:
-
Innovative Application of Graph Theory:
- Reviewer pEVt highlighted that our paper introduces a new graph theory perspective for non-maximum suppression (NMS), which explores new research avenues in this area.
- Reviewer snSe praised our innovative application of graph theory in the QSI-NMS and BOE-NMS algorithms, noting that it provides a detailed theoretical basis and offers a new perspective for understanding and optimizing NMS.
-
Significant Performance Improvements:
- Reviewer HR1f noted the impressive speedups achieved by our algorithms, with QSI-NMS providing up to a speedup and BOE-NMS achieving , all while maintaining mean Average Precision (mAP).
- Reviewer kDQX recognized the substantial speed increases of eQSI-NMS with minimal loss in accuracy, which is highly beneficial for real-time object detection applications.
- Reviewer pEVt also acknowledged the improved latency of our proposed approaches compared to traditional NMS methods, including greedy NMS, Fast NMS, and Cluster NMS, without the need for fine-tuning the underlying model.
-
Comprehensive and Robust Evaluation:
- Reviewer kDQX appreciated our comprehensive evaluation using the newly developed NMS-Bench, which offers detailed comparisons of performance improvements over traditional methods and enhances the robustness of our results.
- Reviewer snSe commended the construction of the benchmark, highlighting its solid contribution to the community by standardizing the evaluation of NMS algorithms and supporting further research.
-
Valuable Theoretical Insights and Practical Contributions:
- Reviewer HR1f emphasized the detailed analysis of the intrinsic structure of NMS through graph theory, which offers valuable insights into its computational bottlenecks and optimization strategies.
- Reviewer snSe acknowledged that our work provides a new perspective and tool for understanding and optimizing the NMS step in object detection, which contributes both theoretically and practically to the field.
- Reviewer kDQX highlighted the practical impact of our proposed methods, particularly eQSI-NMS, in offering substantial speed increases with minimal loss in accuracy, which is crucial for real-time applications.
We would also like to express our gratitude to the reviewers for their thoughtful questions and valuable suggestions. These have been instrumental in guiding the refinement of our work. We will address the reviewers' concerns by clarifying concepts, providing additional experiments, and offering further explanations within the scope of the feedback provided.
Both Reviewer snSe and Reviewer HR1f have expressed concerns about the slight decrease in accuracy with QSI-NMS, so we address this common concern here.
Q1: Why does QSI-NMS have a negligible mAP loss? What conditions can lead to accuracy loss?
A1: In QSI-NMS, we use a divide-and-conquer strategy, which means that bounding boxes in different subproblems do not affect each other. In some special cases, QSI-NMS may assign nodes from the same Weakly Connected Component (WCC) to different subproblems, potentially causing some nodes that should have been suppressed to be retained.
We provide a case study in the PDF with results from the YOLOv8-M model on the COCO dataset. In Figure 1 (a), the blue boxes represent the outputs of Original NMS/BOE-NMS, while Figure 1 (b) shows the outputs of QSI-NMS, with red boxes indicating additional boxes retained by QSI-NMS. It can be seen that QSI-NMS retains four additional boxes.
For example, consider the box/node numbered 188. The WCC containing this node is shown in Figure 2 (a). All other nodes in the WCC would suppress node 188, but when we use to define the partitioning criterion, node 188 ends up in a different subproblem than other nodes, as shown in Figure 2 (b). The figure shows a partial structure of the QSI-tree: solid lines indicate parent-child relationships, and dashed lines indicate ancestor-descendant relationships. The red nodes are nodes from the WCC, while the black node 148 is the Lowest Common Ancestor (LCA) of nodes 188 and 201; node 156 is the LCA of nodes 194 and 193. Since each node in this WCC can only be suppressed by its red ancestor nodes (Lines 201-208), node 188 is not suppressed. However, node 194 is still suppressed because node 201 is its ancestor.
This example highlights the core of QSI-NMS design: the pivot selection and the partitioning criterion. If we choose these two appropriately, the accuracy loss of QSI-NMS can be negligible. In our algorithm design: pivot selection chooses the most representative nodes (with the highest confidence scores), so node 194 is correctly suppressed by node 201 even after being placed in a different subproblem from node 193. The partitioning criterion aims to assign nodes from the same WCC to the same subproblem as much as possible, which helps reduce cases like node 188 being incorrectly retained. We also discuss other partitioning criteria in Appendix C.2.
The reviewers unanimously recommend acceptance of the paper with varying degrees of strength. I agree with their assessment and am happy to accept this paper for publication at the NeurIPS conference.
The reviews and the rebuttal have given rise to several interesting points and results that I encourage the authors to include in the camera-ready version. In particular, the complexity and computation time in practice of your methods and their baselines were much discussed and should be clarified in the paper.