Progressive Exploration-Conformal Learning for Sparsely Annotated Object Detection in Aerial Images
We address sparsely annotated aerial object detection task with a Progressive Exploration-Conformal Learning (PECL) framework.
摘要
评审与讨论
Summary: The paper addresses the challenge of sparsely annotated object detection (SAOD) in aerial images, a critical task for real-world aerial intelligence systems where annotations are limited. Acknowledging the difficulty posed by the imbalanced probabilities and confidences in predicted aerial objects, the paper proposes a novel Progressive Exploration-Conformal Learning (PECL) framework. This framework adaptively selects high-quality pseudo-labels to enhance detection performance. It comprises a conformal pseudo-label explorer and a multi-clue selection evaluator, which together form a decision-making paradigm for pseudo-label exploration. The paper also demonstrates that their method outperforms existing SOTA methods on the DOTA and HRSC2016 datasets. Contributions: The paper makes significant contributions to the field of semi-supervised aerial object detection by addressing key challenges related to sparse annotations and proposing a robust framework that improves detection performance through adaptive and progressive pseudo-label exploration.
优点
(1) The overall paper is technically sound. (2) For originality and significance, the proposed PECL integrates a conformal pseudo-label explorer with a multi-clue selection evaluator to adaptively select high-quality pseudo-labels, offering a new perspective on handling sparse annotations in object detection. (3) For quality and clarity, the methodological framework is well-structured, detailing the iterative process between pseudo-label exploration and detector updating.
缺点
(1) While the PECL framework is innovative, its complexity might pose challenges for practical implementation. The multi-layer perceptron for the conformal pseudo-label explorer and the iterative training process could be computationally intensive. Discussing the impact of these modules on the overall computational cost and inference time is supposed to consider. (2) Can the authors provide more details on the computational efficiency of the PECL framework? Specifically, how does the iterative process impact training time, and are there any optimizations that can be applied to improve efficiency? (3) How does the PECL framework specifically address the detection of small and occluded objects in aerial images? Are there any additional strategies or modifications that could further enhance performance for these challenging cases? (4) The writing is a little unsatisfactory, especially in the first section, where the rationale for the proposed method is not well explained, making it difficult for readers to develop interest in your approach. (5) In Line 71-73, it is difficult for readers to intuitively realize that experimental performance is one of the main contributions without a very convincing data. (6) Shown in Line 142, “which can can adaptively” is a error. Moreover, the case of the first letter of the subsection and subsubsection should be unified in Section The Proposed Method.
问题
See the weaknesses above.
局限性
(1) While the paper claims state-of-the-art performance, it lacks the comparative experiments with the SOTA methods from January to May 2024. Adding several comparative experiments with SOTA methods, such as on CVPR/TGRS 2024, can be considered. (2) For Section Related Work, it would be better to add some latest works published in 2024. For Section The Proposed Method, it would arouse the readers’ interest to add illustration chart of the proposed SCIR framework.
Thank you for your constructive suggestions. Here is our detailed response.
(1) Our proposed conformal pseudo-label explorer and multi-clue selection evaluator are meticulously encapsulated classes that can be directly invoked. In practical implementation, they exhibit excellent user-friendliness, with no challenges to overcome.
(2) Compared with baselines, during the training process, our PECL has a higher time cost due to an iterative detector updating procedure and an extra conformal pseudo-label exploration procedure; however, in the testing process, our method takes the same time as baselines with the same detection network.
(3) Compared with other semi-supervised/sparse-annotated methods, our PECL demonstrates a degree of superiority in the learning process. Specifically, the training time of our PECL, Unbiased Teacher[1], Co-mining[2] methods are 13.35h, 16.74h, 15.42h on the DOTA dataset at 5% label rate. We conduct all experiments on two NVIDIA 2080Ti gpus.
(4) Additionally, we further consider accelerating the convergence speed of the explorer and evaluator by reducing the complexity of the exploratory state and action spaces, as well as the frequency of updating the target network, thereby enhancing overall computational efficiency.
(5) For small and occluded objects in aerial images, our proposed conformal pseudo-label explorer designs exploratory characteristics based on the non-conformity score, which considers the imbalanced probabilities between small and large objects. In future work, we will incorporate the size of objects into the characteristic design to further enhance the performance and robustness of our algorithm.
(6) In order to intuitively reflect that the experimental performance is one of our main contributions, we will revise Lines 71-73 to:``......demonstrate the effectiveness of our PECL, which outperforms the baselines and state-of-the-art methods by at least 5.63%, 1.35%, respectively......’’.
(7) According to our investigation, there are only two semi/sparse-supervised methods for object detection in aerial images in 2024, Pseudo-Siamese Teacher[3](TGRS 2024), S2O-Det[4](TII 2024), and neither of them is open source. We will add them to related work, and track the comparative experiments in the future.
(8) For Section The Proposed Method, it would arouse the readers’ interest to add illustration chart of the proposed SCIR framework.
(9) According to your suggestion, we have presented a diagram of our proposed PECL framework in Figure 4 of overall author rebuttal PDF file.
(10) For minor writing/formatting errors, we will polish this paper in the revision.
Reference
[1] Unbiased teacher for semi-supervised object detection.
[2] Co-mining: Self-supervised learning for sparsely annotated object detection.
[3] Pseudo-Siamese Teacher for Semi-Supervised Oriented Object Detection.
[4] S2O-Det: A Semisupervised Oriented Object Detection Network for Remote Sensing Images.
Thanks for your detailed responses to every weaknesses and limitations we commented. Your responses address certain aspects of my concerns regarding this work and clear plans are provided for improving the manuscript. I keep my rating of Borderline Accept for several concerns remain unresolved and their illustrations and suggestions are as follows:
- Some corresponding experimental analyses are necessary to prove that encapsulated classes of conformal pseudo-label explorer and multi-clue selection evaluator perform excellent user-friendliness compared to other methods
- It would be better to provide more specific evidence or detailed examples to substantiate the same cost time on PECL compared to baselines with the same detection network. But how about on other latest detection network like Mamba-based or DETR-based?
- The overall author rebuttal PDF file exceeds the limitation of one page.
(1)The conformal pseudo-label explorer takes the feature maps and predicted logits as input, computing the exploratory characteristics of the current pseudo-label and outputing a two-dimensional selection probability distribution. The exploratory characteristics consist of non-conformity score, prediction probability and similarity score.
i) The non-conformity score is obtained through conformal prediction and represents the uncertainty of the current pseudo-label, to address the problem of imbalanced prediction probabilities between categories.
ii) The prediction probability is the predicted classification score by the detector at the current stage. It indicates the classification information of the current algorithm for pseudo-labels.
iii) The similarity score is a measure of the consistency between the current pseudo-label features and the maintained prototypes.
In addition, the above three exploration characteristics can be obtained from both the one-stage and two-stage detectors. The multi-clue selection evaluator takes the current characteristic and the output of explorer as input and outputs the cumulative reward. We only need to input the necessary feature maps and predicted logits to complete the calculation in the encapsulation class, so it is user-friendly.
(2) While our PECL will inevitably extend the training time, it remain faster than semi-supervised methods for training. Because the semi-supervised method needs to run two complete models at the same time, namely the teacher model and the student model. Our PECL only needs to add a few lightweight MLP layers. We conduct the experiments on two NVIDIA 2080Ti gpus. Specifically, the training time of Redet, S2ANet methods are 8.833h, 7.265h ,while the training time of Redet w/PECL, S2ANet w/PECL are 13.351h, 10.772h on the DOTA dataset at 5% label rate. But the SOOD (a semi-supervised method) based on Redet needs 18.511h under the same setting. For inference time, under the metric Frames Per Second (FPS) based on a 2080Ti, the inference speed of baseline detector S2ANet and S2ANet w/PECL achieves 13.1 FPS, demonstrating our PECL does not affect the inference time.
(3)As for DETR, due to the complex nature of aerial images (such as densely arranged objects), the DETR series is rarely used in aerial images. Besides, we encountered some difficulties when applying our PECL to DETR-based detector, e.g. ARS-DETR. Our PECL is effective in pseudo-label exploration on dense object detectors, because a sufficient number of proposals provide enough samples for conformal prediction. However, for the DETR-based detector, which has sparse proposals, our PECL cannot perform well. As for Mamba-based detector, according to our research, mamba is currently used as a backbone for object detection and does not affect the application of PECL.
This paper propose a Progressive Exploration-Conformal Learning (PECL) framework to address the sparsely annotated object detection task, which can adaptively perform the selection of high-quality pseudo-labels. The pseudo-label exploration are formulated as a decision-making paradigm by adopting a conformal pseudo-label explorer and a multi-clue selection evaluator. Some evaluations on two public datasets demonstrate the superiority of our PECL
优点
- Paper is well writen.
- It seems interesting to simulate and learn the pseudo-label exploration process using reinforcement learning, i.e. the conformal pseudo-label explorer and the multi-clue selection evaluator.
- The proposed method seems to work well.
- The comparative experiments and ablation experiments are quite substantial.
缺点
- Table 4 lacks comparisons for weakly supervised oriented object detection, which are supported in MMRotate. Specifically, HBox-supervised (H2RBox, H2RBox-v2), Point-supervised (Point2RBox, PointOBB).
- The authors mainly verified it on aerial image, but the proposed technology does not seem to be strongly related to the specific scenario, and therefore lacks verification in more convincing natural scenarios, e.g. COCO.
- More object detection methods in aerial image need to be investigated in related work, such as OBB/HBB, full-/weakly-/semi-supervised.
- After the introduction of reinforcement learning, the visualization of the changes of pseudo labels during the training process needs to be presented. The desired result is a significant improvement in quality.
- The authors verified it on a CNN-based detector, and it would be more perfect if it could be verified on a DETR-based detector, e.g. ARS-DETR.
- The left fields of Table 1 (e.g. 1%) and Table 4 (e.g. Sparse-annotated) do not seem to be centered. You can use
\multirowto adjust them.
问题
- I thought of an interesting setting, that is, Sparsely Annotated weakly-supervised roriented Object Detection, such as sparsely annotated HBox. Is it possible to add a related experiment based on H2RBox-v2 and verify it in combination with PECL techniques? Of course, this is not necessary, it's just my sudden thought. If the author can provide it, I am willing to further improve the rating.
- I am not a professional reinforcement learning researcher, so I am worried whether the training of the entire detector will be slower after the introduction of reinforcement learning. I hope the authors can clarify this.
局限性
N/A
Thank you for your constructive suggestions. Here is our detailed response.
(1) H2RBox[1] learns object center localization from horizontal box annotations in the weak supervision branch, utilizes scale and spatial constraints to learn object width, height, and rotation angle information in the self-supervision branch. H2RBox-v2[2] utilizes the principle of symmetry to learn flip-rotation consistency, thereby predicting the rotation angle of object. PointOBB[3] and Point2RBox[4] use Multiple Instance Learning and Knowledge Combination, respectively, to learn rotated box regression from single point supervision. We will add more weakly-/semi-supervised aerial object detection methods in the related work of the revision, which are all aimed at alleviating the time-consuming and labor-consuming of rotated annotations.
(2) To prove the generality of our PECL, we try to conduct experiments on general detection dataset, COCO dataset. Following the setting of BRL[7], four sparsely annotated training sets are generated. In Table 1 of overall author rebuttal PDF file, we compare the different methods with our method trained under three annotation sets. It can be seen that under all sparse conditions, our PECL outperforms the Co-mining[8] by 0.2%, 0.6%, 0.8%, respectively, indicating that our proposed method can mine supervised signals effectively.
(3) Thank you for the reminder; we will add more OBB/HBB, full-/weakly-/semi-supervised aerial object detection methods in the related work of the revision.
(4) The visualization of the changes of pseudo-labels during the training process has been presented in Figure 3 of overall author rebuttal PDF file. The results indicate a significant improvement in the quality of pseudo-labels after introducing reinforcement learning.
(5) We encountered some difficulties when applying our PECL to DETR-based detector, e.g. ARS-DETR[9]. Our PECL is effective in pseudo-label exploration on dense object detectors, because a sufficient number of proposals provide enough samples for statistics-based conformal prediction.However, for DETR-based detector, which has sparse proposals, our PECL cannot perform well. This idea is worth further exploration in future work.
(6) For minor writing/formatting errors, we will polish this paper in the revision.
(7) Sparsely Annotated Weakly-Supervised Oriented Object Detection is indeed an interesting and valuable research setting. However, we encountered difficulties when integrating our PECL with H2RBox-v2. Specifically, H2RBox-v2 is a weakly supervised method based on FCOS[5] detector, whose label assignment strategy requires the participation of all points on the feature map to classify each candidate point as a positive or negative sample. Therefore, this strategy requires ground truth boxes to have very high quality. We need to modify the FCOS label assignment strategy to allow for the assignment of ignored samples (similar to Faster R-CNN[6]), in order to reduce dependence on the quality of ground truth boxes. This involves a lot of work, and we have just proposed a preliminary idea that has not yet been implemented. We will implement this in future research work.
(8) Compared with baselines, during the training process, our PECL has a higher time cost due to an iterative detector updating procedure and an extra conformal pseudo-label exploration procedure; however, in the testing process, our method takes the same time as baselines with the same detection network.
Reference
[1] H2RBox: Horizonal Box Annotation is All You Need for Oriented Object Detection.
[2] H2RBox-v2: Incorporating Symmetry for Boosting Horizontal Box Supervised Oriented Object Detection.
[3] PointOBB: Learning Oriented Object Detection via Single Point Supervision.
[4] P2RBox: A Single Point is All You Need for Oriented Object Detection.
[5] FCOS: Fully Convolutional One-Stage Object Detection.
[6] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
[7] Solving missing-annotation object detection with background recalibration loss.
[8] Co-mining: Self-supervised learning for sparsely annotated object detection.
[9] ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection.
Thanks to the authors for their detailed responses, some of my concerns have been well addressed. I still have one question and some suggestions.
- Q1: The authors mention that training cost will increase, but how much? The authors need to clarify.
- S1: The authors mentioned that the current PECL will have new issues when combined with DETR detectors, and I suggest that the authors include this in the discussion of limitations.
- S2: As the authors replied to Reviewer hP3V, many semi/sparse-supervised methods are not open source. We hope that the authors can open source the code after the manuscript is accepted to promote the healthy development of the community.
(1) We conduct all experiments on two NVIDIA 2080Ti gpus. Specifically, the training time of Redet, S2ANet methods are 8.833h, 7.265h ,while the training time of Redet w/PECL, S2ANet w/PECL are 13.351h, 10.772h on the DOTA dataset at 5% label rate.
(2) We will outline the application limitations of PECL in the limitations section of the original paper to serve as a reference for future work.
(3) We will make the code publicly available once the work is accepted.
This paper proposes a new learning framework, PECL, for sparsely annotated object detection (SAOD) in aerial images. The framework introduces a conformal pseudo-label explorer and a multi-clue selection evaluator, which can leverage category-specific characteristics and inter-instance contextual relationships. Comprehensive experiments demonstrate the superior performance of the proposed solution compared to prior approaches. Additionally, the paper includes numerous ablation studies and case studies that examine the various properties of the PECL framework.
优点
- Aerial imagery has many useful properties for sparely annotated object detection. It is good that this paper explores those properties to develop a better SAOD framework.
- The paper includes a substantial number of ablation studies and case studies, which could unveil valuable insights into the proposed method.
缺点
-
The paper highlights PECL's effectiveness in addressing SAOD tasks by comparing the PECL-enabled version with a simple supervised baseline. The results are appealing, which is good. However, when comparing PECL with other SAOD solutions, the improvement is modest, at 1.35%.
-
The SAOD task in aerial imagery does not appear to be more challenging than SAOD tasks for general imagery, as objects in aerial imagery typically have only one degree of freedom in rotation. This reduces the significance of PECL's contribution.
-
The writing can be improved. The paper is somewhat hard to follow, for example, lines 128-131. Sometimes, even the logic is hard to follow. For instance, in line 31, the average number of objects per image seems irrelevant, as it is a dataset statistic. Neighboring aerial images could be stitched together to form a larger image with more objects.
问题
Please check out the weaknesses section for details.
局限性
I don't see potential negative societal impact of this work.
Thank you for your constructive suggestions. Here is our detailed response.
(1) Our proposed PECL has improved performance by at least 5.63% compared to supervised baselines. Compared to other state-of-the-art methods, we have also achieved at least 1.35% growth. Given the limited improvement of other methods, this achievement is already quite challenging in the current research field, and it also indicates that our method still has a lot of room for improvement.
(2) Compared to SAOD task in general images, SAOD task in aerial images faces greater challenges.
- Firstly, objects in aerial images have characteristics such as arbitrary orientation, dense distribution, and large aspect ratios. Figure 1 of overall author rebuttal PDF file intuitively counts the number of objects in each image in the DOTA and COCO datasets, which reflects the relatively high density of objects in aerial images. Motivated by this, our proposed PECL utilizes rich contextual information between objects to mine high-quality pseudo-labels.
- Secondly, aerial datasets often exhibit serious long tail problems, resulting in significantly imbalanced prediction results for different classes. Figure 2 of overall author rebuttal PDF file shows the statistical analysis on the selected candidates of each class. As in Figure 2, when the threshold is with 0.9, less positive/negative candidates are selected; when reducing it to 0.5, positive/negative candidates are all increased. This confirms that it is difficult to select pseudo-labels with a certainty threshold. In contrast, our PECL employs reinforcement learning to learn an adaptive policy to decide the confidence of selecting candidates. Therefore, SAOD in aerial images is necessary and challenging.
- The diversity of rotation angles remains another challenge in aerial object detection. The introduction of rotation angles is not merely about adding a parameter to learn. The periodicity of angles leads to the boundary discontinuity and square-like problems[1], causing issues in the regression of rotated boxes and affecting the localization quality of rotated bound boxes. This is also an issue worth investigating in the SAOD task and can serve as the motivation for future work.
(3) For minor writing/formatting errors, we will polish this paper in the revision.
Reference
[1] Detecting Rotated Objects as Gaussian Distributions and Its 3-D Generalization
Accurate annotation is a key factor in ensuring object detection performance. However, the manual annotation process is time-consuming and labor-intensive, especially in remote sensing scenarios with densely arranged objects. In this paper, we propose the Sparsely Annotated Object Detection (SAOD) task, which aims to perform remote sensing object detection using sparse annotations (e.g., 5%, 10%).To address the SAOD task in aerial images, we propose an innovative Progressive Exploration-Conformal Learning (PECL) framework, which consists of a conformal pseudo-label explorer and a multi-clue selection evaluator. This framework takes into account the imbalance in predicted results between categories and the rich contextual information to mine confident pseudo-labels, thereby enhancing the performance of the detector. The conformal pseudo-label explorer learns an adaptive pseudo-label policy by maximizing cumulative rewards. The multi-clue selection evaluator aims to optimize the policy by providing guiding feedback. Ultimately, the explored pseudo-labels can be utilized to guide the closed-loop iterative optimization of the aerial detector. We sincerely respond to each reviewer under their comments.
The paper addresses the challenge of sparsely annotated object detection (SAOD) in aerial images, a demanding but practical task for real-world aerial intelligence systems where there is a wider variety of aerial scenes with the same number of annotated objects. The approach can adaptively perform the selection of high-quality pseudo-labels to enhance detection performance. It comprises a conformal pseudo-label explorer and a multi-clue selection evaluator, which together form a decision-making paradigm for pseudo-label exploration.
The paper also demonstrates that their method outperforms existing SOTA methods on the DOTA and HRSC2016 datasets. However, when comparing PECL with other SAOD solutions, the improvement is modest, at 1.35%. Also more object detection methods in aerial image need to be investigated in related work, such as OBB/HBB, full-/weakly-/semi-supervised. The authors mainly verified it on aerial image, but the proposed technology does not seem to be strongly related to the specific scenario, and therefore lacks verification in more convincing natural scenarios, e.g. COCO.