CSPCL: Category Semantic Prior Contrastive Learning for Deformable DETR-Based Prohibited Item Detectors
摘要
评审与讨论
This paper proposes a plug-and-play contrastive learning mechanism, named Category Semantic Prior Contrastive Learning (CSPCL), to address the feature coupling issue caused by object overlap in X-ray images. CSPCL ingeniously leverages category semantic prior knowledge embedded in the classification head to guide and enhance content queries, enabling Deformable-DETR-based models to achieve stronger foreground feature perception. To support this guidance, the paper introduces a novel contrastive learning loss function, CSP loss, which consists of ITA loss for positive pairs within the same class and IAR loss for negative pairs across different classes. This design decomposes the contrastive learning problem into two sub-problems—inter-class repulsion and intra-class attraction—allowing for more targeted optimization aligned with specific tasks. CSP loss demonstrates superior performance compared to classical N-pair loss and InfoNCE loss. Extensive evaluations on four prohibited item detection datasets show that CSPCL can improve the detection accuracy of Deformable DETR-based models without introducing additional computational overhead and further establishes a new state-of-the-art benchmark for prohibited item detection models.
优缺点分析
Strengths: 1.The idea proposed in this paper is highly novel. Instead of focusing on typical modifications of label assignment strategies or network structures in mainstream DETR-like models, the authors introduce a simple and efficient contrastive learning strategy that guides content queries to learn category semantic features, thereby improving detection accuracy. This is particularly effective for addressing the feature coupling issue between foreground and background caused by the unique overlapping phenomenon in X-ray images. 2.CSPCL is a plug-and-play method that can be broadly applied to Deformable-DETR-based models. It improves detection accuracy without increasing computational complexity, which is a significant advantage in the current landscape that favors large models and scaling laws. 3.The CSP loss is simple, intuitive, and effective. It decomposes the contrastive learning problem into two sub-problems: inter-class repulsion and intra-class attraction. This targeted design, tailored to the specific task, outperforms classical losses such as N-pair loss and InfoNCE loss. 4.CSPCL demonstrates strong generalizability, achieving effective results across different models, backbones, and datasets. This contribution is commendable, as it represents a more fundamental and insightful approach rather than a simple model-specific modification. 5.The paper is well-written, clearly structured, and easy to follow. The method is explained step by step, the experiments are comprehensive and rigorous, and the figures clearly illustrate how the proposed augmentation strategy is applied during training. Weaknesses: (1) The code for the paper does not appear to be fully open-sourced; instead, it only provides the implementation of the main innovations. (2) In Tables 3 and 4, the paper should supplement experimental results for AO-DETR (ResNet-50) + CSPCL to comprehensively demonstrate whether smaller-scale models can also achieve SOTA results with the CSPCL mechanism. (3) The Non-Maximum Suppression (NMS) post-processing step has not been cited.
问题
1.The authors should consider adding additional experiments, for example, to investigate whether CSPCL can help smaller models, such as AO-DETR (ResNet-50), surpass the performance of SOTA models when using CSPCL. 2.Based on the paper and code, the current design of CSP loss only works for cases where the number of content queries is an integer multiple. The authors should consider how category-specific content queries should be allocated in scenarios where the number of content queries is not an integer multiple of the number of categories, such as with 100 content queries and 15 categories. If this aspect can be improved, CSPCL would be able to adapt to fine-tuning tasks based on Deformable-DETR models for various numbers of content queries, further enhancing the significance of this work. 3.For better readability, it would be helpful to clarify the meaning of all variables. For example, in Lines 151-160, the dimensions N and C are not explained. Although this doesn’t affect understanding, it does impact reading fluency. 4.The authors should carefully check the completeness of the references to ensure that all relevant citations are included. This will help readers better understand the content of the paper and the inspirations behind it.
局限性
Yes. It's worth noting that, in natural-light images, densely appearing objects often lead to occlusion, resulting in the loss of foreground object features, while the kind of foreground-background feature coupling problem always found in X-ray images is rare. Therefore, the inability to demonstrate CSPCL’s effectiveness on the COCO dataset actually highlights the strength of the proposed method in addressing the specific challenge of overlapping phenomena.
最终评判理由
After reading the response, I decide to maintain my original score.
格式问题
None
Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below.
1. The code for the paper does not appear to be fully open-sourced; instead, it only provides the implementation of the main innovations.
Thank you for your review of our work and for raising these questions.
We fully understand the importance of open-source code for the reproducibility and further development of research. The core code related to the embedding and application of CSP loss has been fully presented in the supplementary material ZIP file, which is sufficient for understanding the operational logic and execution details of CSPCL. However, due to the risk of code leakage and plagiarism during the review period, we have currently only made public the implementation code for the main innovations in the paper.
Once the paper is accepted, we will open-source the complete code and weight files to promote academic exchange and collaboration. Thank you for your understanding and support.
2. In Tables 3 and 4, the paper should supplement experimental results for AO-DETR (ResNet-50) + CSPCL to comprehensively demonstrate whether smaller-scale models can also achieve SOTA results with the CSPCL mechanism.
Thank you for your valuable suggestion!
Our original intention was to train the best prohibited item detection model, which led us to overlook the fairness of the comparison in Tab. 3 of the original paper.
To comprehensively and fairly demonstrate whether smaller-scale C-AO-DETR models can also achieve SOTA results, we trained C-AO-DETR with the same backbone network, ResNet-101, on the PIDray dataset. The comparison results are as follows:
| Method | Backbone | Image Size | AP_{easy} | AP_{hard} | AP_{hidden} | AP_{overall} |
|---|---|---|---|---|---|---|
| FCOS | ResNeXt-101-64x4d | 1333 × 800 | 72.9 | 63.4 | 51.1 | 62.5 |
| FSAF | ResNeXt-101-64x4d | 1333 × 800 | 71.0 | 61.9 | 50.7 | 61.2 |
| SDANet | ResNet-101 | 500 × 500 | 71.2 | 64.2 | 49.5 | 61.6 |
| ForkNet | ResNet-101 | 500 × 500 | 75.0 | 66.9 | 58.6 | 66.8 |
| -AO-DETR (ours) | ResNet-101 | 500 × 500 | 76.1 | 69.2 | 61.5 | 68.9 |
Under the same backbone network and image size, C-AO-DETR still outperforms the latest prohibited item detectors, SDANet and ForkNet. Furthermore, general object detectors, such as FCOS and FSAF, even with more powerful backbones (ResNeXt-101-64x4d) and larger image sizes (1333 × 800), still fail to surpass C-AO-DETR in the prohibited item detection task.
Additionally, for the CLCXray dataset, our results are as follows:
| Mthoed | AP | AP | AP | AP | AP | AP |
|---|---|---|---|---|---|---|
| DOAM | 54.3 | 68.5 | 63.5 | 31.2 | 27.3 | 59.9 |
| LAreg | 58.5 | 70.9 | 67.7 | 12.6 | 30.5 | 63.8 |
| LAcls | 59.3 | 71.8 | 68.2 | 23.0 | 32.4 | 64.5 |
| PIXDet | 60.4 | 72.4 | 69.4 | 25.3 | 31.7 | 66.0 |
| ForkNet | 62.0 | 74.0 | 71.4 | 35.6 | 41.9 | 67.7 |
| C-AO-DETR(ours) | 62.7 | 74.3 | 71.9 | 45.1 | 36.2 | 68.4 |
Overall, C-AO-DETR proposed in this paper demonstrates superior detection accuracy. More importantly, the CSPCL mechanism is not limited to AO-DETR. When more advanced Deformable DETR-based models are available, CSPCL remains applicable, enhancing the model's anti-overlap detection capability.
3. The Non-Maximum Suppression (NMS) post-processing step has not been cited.
Thank you for your meticulous review and valuable suggestions.
We will cite the Non-Maximum Suppression (NMS) post-processing step in the final version to ensure that all related work is appropriately acknowledged. Additionally, we have conducted a thorough review of the paper to ensure that there are no other citation omissions.
Thank you to the author for the thoughtful response. After carefully evaluating the opinions of other reviewers and considering the author's rebuttal, I believe the author has effectively addressed my concerns. Therefore, I maintain my original rating.
Dear Reviewer jNYc,
Thank you for your affirmation, encouragement, and valuable contribution!
If you have any concerns, please feel free to contact me, so I can further improve the quality of the paper and contribute to the community. Thank you for your valuable insights and selfless efforts.
Best Regards,
Authors.
This paper proposes a Category Semantic Prior Contrastive Learning (CSPCL) mechanism to improve prohibited item detection in Deformable DETR-based models, addressing the challenge of object overlapping in X-ray images. The core idea is to remedy the issue of content queries losing category-specific semantic information during the decoder's update process. CSPCL aligns content queries with class prototypes derived from the classifier's weights. This is achieved through a devised CSP loss, featuring two key components: an Inter-Class Adaptive Repulsion (IAR) loss that pushes dissimilar classes apart with a force proportional to their feature similarity, and an Intra-Class Truncated Attraction (ITA) loss that prevents feature homogenization by stopping attraction at a set threshold. The authors demonstrate that this plug-and-play module enhances performance across four datasets without increasing inference complexity.
优缺点分析
Strengths
- The paper is well-motivated, clearly identifying a plausible weakness in Deformable DETR models to provide a justification for its proposed method.
- The paper proposes a practical method, as the CSPCL module is plug-and-play and adds no computational overhead during inference.
Weaknesses
-
The paper's originality seems limited. This paper appears to be an incremental improvement on existing contrastive learning works, which has been widely explored in various object detection architectures, such as [1] and [2]. The idea of inter-class and intra-class contrasts have also been adopted in previous studies, such as [3] and [4]. Specifically, there also exists a prototype-based contrastive learning work in X-ray prohibited items detection [5]. The proposed adaptive loss, while thoughtful, refines this existing foundation rather than introducing a new paradigm. Therefore, it’s suggested that the author should further address the above originality concerns.
-
The paper lacks a direct experimental comparison with other X-ray detection methods. In table 3, none of the anti-overlapping methods (e.g. [6][7]) mentioned in related works are compared. Additionally, the most relevant competitors MMCL[8] that uses contrastive learning on DETR-based models for X-ray images is neglected. The baseline model AO-DETR is not compared, either. Therefore, it’s suggested that the author should conduct head-to-head comparisons with more X-ray detectors to empirically validate the claimed benefits of the proposed CSPCL.
-
The SOTA comparison in Table 3 is difficult to interpret fairly due to an inconsistent baseline. The presented C-AO-DETR uses a Swin-L backbone, which is more powerful than the ResNet/ResNeXt backbones used by most competing methods listed in the table. This makes it difficult to attribute performance gains to the CSPCL module itself rather than to the superior capacity of its baseline architecture.
-
The proposed method appears to have limited applicability beyond Transformer-based architectures. Similar to techniques that are exclusively applicable to CNN-based detectors, this work seems constrained to Transformer architectures. Moreover, the results in Table 3 show that the performance of Transformer-based detectors does not surpass the CNN-based Mask R-CNN. Therefore, it is recommended that the authors provide further explanation of the practical value of CSPCL.
-
The paper does not clearly explain how the proposed method solves the object overlapping problem. The argument relies on indirect reasoning and a single qualitative example, rather than a quantitative analysis that directly measures performance on overlapping objects.
-
The paper lacks a sufficient discussion of its key design choices. This includes the selection and sensitivity analysis of new hyperparameters , as well as the justification for which decoder layer to apply the proposed loss.
[1]Wei, Fangyun, et al. "Aligning pretraining for detection via object-level contrastive learning." Advances in Neural Information Processing Systems 34 (2021): 22682-22694. Li M, Jia T, Lu H, et al.
[2]Xie, Enze, et al. "Detco: Unsupervised contrastive learning for object detection." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[3]Li, Wenbin, et al. "Modeling inter-class and intra-class constraints in novel class discovery." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
[4]Chen, Hao, Benoit Lagadec, and Francois Bremond. "Ice: Inter-instance contrastive encoding for unsupervised person re-identification." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[5]Tao, Renshuai, et al. "Few-shot x-ray prohibited item detection: A benchmark and weak-feature enhancement network." Proceedings of the 30th ACM international conference on multimedia. 2022.
[6]Y. Wei, R. Tao, Z. Wu, Y. Ma, L. Zhang, and X. Liu, “Occluded prohibited items detection: An x-ray security inspection benchmark and de-occlusion attention module,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 138–146.
[7]C. Zhao, L. Zhu, S. Dou, W. Deng, and L. Wang, “Detecting overlapped objects in x-ray security imagery by a label-aware mechanism,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 998–1009, 2022
[8]MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection[J]. arXiv preprint arXiv:2406.03176, 2024.
问题
1.The proposed method defines class prototypes using the classifier's weights, which are simultaneously updated by both the classification loss and the new CSP loss. This raises a question about the stability of the prototypes. It is suggested that the authors discuss if this dual-update mechanism could cause the prototypes to become unstable, or merely reflect an average of the content queries rather than serving as a fixed semantic prior.
2.The ITA loss is designed with a gradient truncation mechanism to preserve intra-class variance, with the stated goal of handling variations in object appearance. However, the direct benefit of this preserved variance is not empirically demonstrated. It is suggested that the authors provide a more direct analysis linking this specific design choice to a measurable improvement in performance, for example, on categories with high fine-grained diversity.
3.The IAR loss utilizes an exponential function to scale the repulsion strength based on inter-class prototype similarity. The paper does not provide a justification for this specific mathematical choice. It is suggested that the authors explain why an exponential function was chosen over simpler alternatives, such as a linear scaling factor.
4.The paper is ambiguous regarding which decoder layers the CSP loss is applied to, mentioning "layer 1" in one instance but a "set of target layers" () in the algorithm description. It is suggested that the authors clarify this crucial implementation detail and provide a clear justification for their choice.
5.The core idea of using classifier weights as learnable class prototypes has been explored in prior literature such as ADAM[1]. It is suggested that the authors more clearly articulate the novelty of their approach in comparison to these existing works. A more explicit discussion of how the proposed CSP loss, particularly its adaptive and truncated nature, provides a distinct advantage over other prototype-based learning schemes would help to better situate the paper's contribution .
[1]Zhou D W, Cai Z W, Ye H J, et al. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need[J]. International Journal of Computer Vision, 2025, 133(3): 1012-1032.
局限性
Yes.
最终评判理由
The authors have addressed most of my earlier concerns with thoughtful and detailed responses. However, the inter-class repulsion concept in CSPloss is relatively common, which limits the perceived novelty. For the X-ray prohibited items detection task, the loss design could be better tailored to challenges such as heavy occlusion and overlapping objects. Overall, given the improvements and clarifications provided, I lean towards a borderline accept.
格式问题
No formatting issues noticed. The paper follows NeurIPS 2025 guidelines.
Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below. Please note! Due to the rebuttal character limit, for questions that have already been addressed in other reviewers' comments, we will only display the reviewer's name and the question number. Please check the relevant answer for the detailed response. Additionally, the papers’ cited id matches your comment. Thank you for your understanding!
1.1 This paper appears to be an incremental improvement on existing contrastive learning works, which has been widely explored in various object detection architectures, such as [1] and [2].
Both SoCo[1] and DetCo[2] design self-supervised pretext tasks to provide a strong pre-trained model for instance-level detection tasks. Essentially, they are upstream tasks for full-supervised object detection in natural light images, whereas CSPCL is focused on the downstream task about prohibited item detection in X-ray images. They can be compatible at their respective training stages. For example, we can fine-tune a pre-trained model by DetCo for prohibited item detection tasks.
Weakness: SoCo and DetCo are designed for unsupervised tasks, and their construction of object-level positive and negative sample pairs relies on selective search and local patch sets of image-level inputs. To ensure the quality of sample pairs, these models are more suitable for pre-training on "one image for one class" datasets like ImageNet. However, when facing complex multi-object overlapping X-ray image datasets such as PIXray, their supervisory information becomes unreliable.
1.2 The idea of inter-class and intra-class contrasts have also been adopted in previous studies, such as IIC loss [3] and ICE loss [4].
We greatly appreciate the valuable advice! This comparison on IIC loss [3] and ICE loss [4] has already been addressed in response to Question 2 of Reviewer eTaq. Due to the rebuttal character limit, please kindly refer to that for further details, thank you!
1.3 Specifically, there also exists a prototype-based contrastive learning work in X-ray prohibited items detection [5].
WEN[5] focuses on a different aspect, as it is designed for few-shot prohibited item detection. Specifically, first, the PR module generates a prototype library by aggregating and extracting the base features from critical regions around instances obtained by the RPN and ROI modules. Then, the FR module adaptively integrates the base features and the corresponding prototype to enhance the feature extraction ability of few-shot models.
Inter-class repulsion design: The prototype-based contrastive learning in this work is calculated through orthogonalization (inter-class repulsion) after dividing ROI proposals into K groups, as follows: Where refers to the distance calculated by the cosine function, and are class indices of the class prototype , and is the total class number. Thus, by minimizing , inter-class prototypes tend to be orthogonal gradually.
Weakness: This orthogonal repulsion strategy does not align with the true distribution of class prototypes among the samples.
1.4 The proposed adaptive loss, while thoughtful, refines this existing foundation rather than introducing a new paradigm. Therefore, it’s suggested that the author should further address the above originality concerns.
CSPCL is specifically designed for fully supervised prohibited item detection tasks, whose originality is reflected in two main aspects:
Firstly, our plug-and-play CSPCL mechanism utilizes class prototypes to clarify the class semantic information of content queries for the Deformable DETR-based models.
Secondly, its CSP loss, through the intra-class gradient truncation and inter-class adaptive repulsion mechanisms, ensures that intra-class content queries do not become homogenized, preserving the necessary feature diversity within the class, while also ensuring that inter-class content queries align with the class prototypes' feature distribution, learning discriminative identifying features for similar categories.
2.1 The paper lacks a direct experimental comparison with other X-ray detection methods. In table 3, none of the anti-overlapping methods (e.g. [6][7]) mentioned in related works are compared.
Thank you for your suggestion, and apologies for the confusion! We have made clarifications and added experiments:
Clarification: Tab. 3 in the original paper already includes results for the SOTA prohibited item detectors, SDANet and ForkNet, on the PIDray dataset. Additionally, Tab. 2 in the supplementary material contains comparisons between methods like LAreg[7], LAcls[7], DOAM[6], and our CSPCL on the CLCXray[7] dataset (Note that CLCXray was chosen because some methods like LAreg, LAcls, and MMCL[8] are not fully open-sourced, making it impossible to perform a complete comparison on the PIDray dataset).
Addition Experiment: For a fair comparison, we provide the results of C-AO-DETR (ResNet-101; 500 × 500), as follows:
| Method | Backbone | Image Size | AP_{easy} | AP_{hard} | AP_{hidden} | AP_{overall} |
|---|---|---|---|---|---|---|
| SDANet | ResNet-101 | 500×500 | 71.2 | 64.2 | 49.5 | 61.6 |
| ForkNet | ResNet-101 | 500×500 | 75.0 | 66.9 | 58.6 | 66.8 |
| C-AO-DETR (ours) | ResNet-101 | 500×500 | 76.1 | 69.2 | 61.5 | 68.9 |
As shown, our C-AO-DETR, outperforms other prohibited item detection models in the same condition.
2.2 Additionally, the most relevant competitors MMCL[8] that uses contrastive learning on DETR-based models for X-ray images is neglected.
Thank you for your constructive advice! The comparison experiments and analysis about MMCL[8] have already been conducted in response to Question 2 of Reviewer eTaq.
2.3 The baseline model AO-DETR is not compared, either.
Thanks for your valuable advice!
Regarding the comparison with the baseline model AO-DETR, we had validated it on both the PIXray and OPIXray datasets (as shown in Tab. 2 of the original paper). Please kindly check it.
The further comparison of the baseline model AO-DETR on PIDray has been addressed in response to Question 3 of Reviewer BMUt.
3 The SOTA comparison in Table 3 is difficult to interpret fairly due to an inconsistent baseline. The presented C-AO-DETR uses a Swin-L backbone, which is more powerful than the ResNet/ResNeXt backbones used by most competing methods listed in the table.
Thank you for your important suggestion! The fair comparison on same backbone ResNet-101 has been conducted in response to Question 1 of Reviewer eTaq.
4.1 The proposed method appears to have limited applicability beyond Transformer-based architectures. Similar to techniques that are exclusively applicable to CNN-based detectors, this work seems constrained to Transformer architectures.
In YOLO or Faster R-CNN architectures, due to the absence of an encoder-decoder structure (and no content queries), CSPCL cannot be directly integrated to leverage content queries for learning category semantic priors in the classifier.
However, inspired by WEN[5], we can attempt to adaptively align and combine features from the backbone with classifier prototypes to supplement the foreground feature semantics and improve model performance. We will evaluate the feasibility of this approach and include the results in the revised version of the paper where possible.
4.2 Moreover, the results in Table 3 show that the performance of Transformer-based detectors does not surpass the CNN-based Mask R-CNN. Therefore, it is recommended that the authors provide further explanation of the practical value of CSPCL.
We have added experiment to demonstrate that, under the same backbone and image size, AO-DETR outperforms Mask R-CNN. The specific results are as follows:
| Method | Backbone | Image Size | AP_{easy} | AP_{hard} | AP_{hidden} | AP_{overall} |
|---|---|---|---|---|---|---|
| MaskR-CNN | ResNet-101 | 500×500 | 64.7 | 59.0 | 43.8 | 55.8 |
| AO-DETR | ResNet-101 | 500×500 | 75.7 | 68.5 | 60.2 | 68.1 |
5 The paper does not clearly explain how the proposed method solves the object overlapping problem. The argument relies on indirect reasoning and a single qualitative example, rather than a quantitative analysis that directly measures performance on overlapping objects.
We greatly appreciate the valuable advice!
The more in-depth theoretical analysis has already been addressed to Question 1 of Reviewer BMUt.
To address the missing quantitative analysis in the original manuscript, we supply an experiment on the PIDray dataset, which has three subsets with increasing occlusion and overlap levels: easy, hard, and hidden. The specific results are as follows:
| Method/Metric | AP_{easy} | AP_{hard} | AP_{hidden} |
|---|---|---|---|
| AO-DETR | 81.8 | 72.6 | 55.7 |
| C-AO-DETR (ours) | 82.2 | 73.3 | 56.9 |
| Absolute Increment (AI) | 0.4 | 0.7 | 1.2 |
| Increment Space (IS) | 18.2 | 27.4 | 44.3 |
| Relative Increment (RI) | 2.20(min) | 2.55 | 2.71(max) |
We design the RI metric (AI/IS), which is fairer than the AI metric.
Discussion: Both AI and RI values show that CSPCL significantly improves the model's overlapping detection ability, with the largest relative improvement in the hidden subset (2.71%), followed by the hard (2.55%) and easy (2.20%) subsets. This confirms that CSPCL enhances the model's anti-overlapping ability to handle more complex occlusion scenarios, particularly when objects are heavily overlapped.
6. The paper lacks a sufficient discussion of its key design choices. This includes the selection and sensitivity analysis of new hyperparameters, as well as the justification for which decoder layer to apply the proposed loss.
Sorry for the confusion! Note that we had conducted these experiments and had placed the results in the supplementary materials. Please kindly check the ZIP file in the supplementary materials again. Thank you! In the revised version, we will place this content in the main text!
Thank you to the authors for the thoughtful rebuttal and the additional clarifications. However, the method still requires further advancement in terms of genuine innovation, rather than merely adapting existing ideas to a new setting. Therefore, I have decided to maintain my original score.
Dear reviewer, we appreciated your efforts when reviewing this paper. With only a couple of days remaining for discussion, could you please kindly provide your responses and further concerns (if any) to the authors' rebuttal? Thanks!
Dear reviewer,
Thank you for your careful review and feedback!
I would appreciate any suggestions on areas where the innovation of my method could be further improved.
The following points need to be clarified:
(1) Suppose the concern is regarding the innovation of our contrastive loss function. In that case, as I mentioned in my response to Question 1, our method differs from previous loss functions as it is specifically designed for the fully supervised prohibited item detection task (multi-class and multi-sample).
(2) Specifically, our CSP loss, through the intra-class gradient truncation and inter-class adaptive repulsion mechanisms, ensures that intra-class content queries do not become homogenized, preserving the necessary feature diversity within the class, while also ensuring that inter-class content queries align with the class prototypes' feature distribution, learning discriminative identifying features for similar categories.
(3) Note! This paper pionneringly demonstrates the importance of clarifying query class semantics in the decoder of Deformable DETR, representing a significant exploration for DETR-like models.
(4) Additionally, we propose the plug-and-play CSPCL mechanism pionneringly utilizes class prototypes from classifier weights to clarify the class semantic information of content queries for the Deformable DETR-based models.
(5) Therefore, our CSPCL holds substantial potential implications for the field of prohibited item detection and even for the general object detection and segmentation field.
If you have any concerns, please feel free to contact me, so I can further improve the quality of the paper and contribute to the community. Thank you for your valuable insights and selfless efforts.
Best regards,
Authors
Dear Reviewer GPtN,
We would greatly appreciate it if you could kindly recommend a specific contrastive learning work that you believe would be more suitable, powerful, and innovative for our task! Your suggestion would help us gain a better understanding and address your concerns more effectively.
Through discussions with reviewers BMUt (Q5), eTaq (Q2), and GPtN (Q1) regarding the innovation of the contrastive loss named CSP loss, we fully recognize that CSP loss has tremendous potential value, while the previous description of the innovation of CSP loss was not clear and thorough enough.
Below is the addition: For the multi-class multi-sample contrastive task, intra-class queries need to prevent homogenization, while inter-class queries need to extract the discriminative identifying features for similar categories.
Once again, we would like to express our gratitude for your hard work. If you have any suggestions or opinions, please feel free to share them with me at any time.
Best Regards,
The Authors.
Dear Reviewer GPtN,
Regarding the innovation of our paper, we are pleased to have discovered a recent contrastive learning paper in CVPR2025 and have made a comparison with it. The supplementary analysis is as follows:
Object-aware Contrastive (OCA) Loss[1]: OCA loss is based on the InforNCE loss, and its intra-class aggregation functionality comes from the following formula:
The represents the similarity of positive sample pairs, is the similarity of negative sample pairs, and represents positive samples with cosine similarity below a threshold. This formula essentially aims to orthogonalize positive sample pairs that are already dissimilar.
Weakness: This method can not prevent the homogenization of similar positive sample pairs in the intra-class, while our ITA loss does. Therefore, this approach is not suitable for our multi-class and multi-sample task.
[1] Object-aware Sound Source Localization via Audio-Visual Scene Understanding, CVPR2025
This paper proposes CSPCL, a Category Semantic Prior Contrastive Learning mechanism designed to enhance Deformable DETR-based object detectors for X-ray security imagery, particularly for detecting prohibited items. The method introduces a novel CSP loss, composed of Intra-Class Truncated Attraction (ITA) and Inter-Class Adaptive Repulsion (IAR), to align content queries with class prototypes (classifier weights), thereby correcting and supplementing the semantic information degraded by overlapping features. CSPCL is plug-and-play, architecture-agnostic, and demonstrates consistent performance gains across four datasets and multiple backbone models.
优缺点分析
Strengths:
- This paper focuses on a critical issue in the field of X-ray detection.
- The proposed plug-and-play CSPCL outperforms classic contrastive learning methods—such as N-pair and InfoNCE losses—in enhancing anti-overlapping detection capability.
- Experiments demonstrate the effectiveness of the proposed CSPCL for X-ray image detection.
Weaknesses:
- I observe that the backbone of proposed C-AO-DETR is Swin-L, which is inherently more powerful than those used in the other baselines; moreover, the input image sizes vary across methods, making the comparison unfair
- The paper only compares against relatively old contrastive losses (InfoNCE, N-pair), none of which were designed specifically to improve anti-overlapping detection. The authors should include more recent loss functions to further validate their approach
- The models listed in Table 2 are quite similar; the authors should evaluate a broader range of more recent, diverse architectures to strengthen their analysis.
问题
-
Does the model trained with this loss generalize well to other downstream tasks, such as zero-shot or open-set detection?
-
Could you provide additional visual results—for example, ten more examples—similar to those shown in Figure 5?
局限性
Yes, the authors discuss limitations in supplementary material.
最终评判理由
My concerns have been addressed.
格式问题
NA
Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below.
1. I observe that the backbone of proposed C-AO-DETR is Swin-L, which is inherently more powerful than those used in the other baselines; moreover, the input image sizes vary across methods, making the comparison unfair
Thank you for the valuable advice!
Our original intention was to train the most powerful prohibited item detector, which led us to overlook the fairness of the comparison in Tab. 3 of the original paper.
For fairness, we train C-AO-DETR with the same backbone network, ResNet-101, on the PIDray dataset:
| Method | Backbone | Image Size | AP_{easy} | AP_{hard} | AP_{hidden} | AP_{overall} |
|---|---|---|---|---|---|---|
| FCOS | ResNeXt-101-64x4d | 1333×800 | 72.9 | 63.4 | 51.1 | 62.5 |
| FSAF | ResNeXt-101-64x4d | 1333×800 | 71.0 | 61.9 | 50.7 | 61.2 |
| SDANet | ResNet-101 | 500×500 | 71.2 | 64.2 | 49.5 | 61.6 |
| ForkNet | ResNet-101 | 500×500 | 75.0 | 66.9 | 58.6 | 66.8 |
| C-AO-DETR (ours) | ResNet-101 | 500 × 500 | 76.1 | 69.2 | 61.5 | 68.9 |
Under the same backbone and image size, C-AO-DETR still outperforms other SOTA prohibited item detectors, SDANet and ForkNet. Furthermore, general object detectors, such as FCOS and FSAF, even with more powerful backbones (ResNeXt-101-64x4d) and larger image sizes (1333 × 800), still fail to surpass C-AO-DETR in the prohibited item detection task.
This result will be added to Table 3 in the original paper.
2. The paper only compares against relatively old contrastive losses (InfoNCE, N-pair), none of which were designed specifically to improve anti-overlapping detection. The authors should include more recent loss functions to further validate their approach
Thanks for your valuable advice! We have added an analysis and comparison of three recent works on contrastive losses, including MMCL[1], IIC loss [2], and ICE loss [3].
(1)MMCL [1]: MMCL is a recent and only work that uses contrastive learning to improve the anti-overlapping detection ability of Deformable DETR-based models. MMCL groups content queries and applies orthogonal repulsion between groups, enabling content queries to learn different category semantic information. However, this orthogonal repulsion does not align with the true distribution of class prototypes among the samples. As shown in Fig. 1 of our original paper, the representations of knife and saw are closer together in feature space than those of knife and fireworks, because knife and saw share more similar color, texture, and contour information compared to knife and fireworks. Consequently, the inter-class orthogonal repulsion in MMCL hinders content queries from learning the true category semantics, particularly for similar categories, such as knife and saw.
Analysis MMCL with Formulas: The inter-class repulsion function of MMCL:
1[|k_1 \neq k_2|] \cdot \log(1 - sim(q_i^{k_1},q_j^{k_2})),$$ where sim() represents the cosine similarity, $q_i^{k_1}$ represents the $i$-th content query from the $k_1$-th group, and $q_j^{k_2}$ is the $j$-th content query from the $k_2$-th group. Weakness: This method lacks the correct guidance from class prototypes, and it simply applies equivalent repulsion between content queries from different groups, making them tend toward orthogonality. **Improving the MM loss using the CSPCL strategy:** Further, following our CSPCL strategy, if we try to replace $q_i^{k_1}$ with the class prototypes $p_i^{k_1}$, the inter-class repulsion formula of MMCL would be as: $$L_{MIE-CSPCL}(Q, K) = -\frac{1}{K(K-1)n^2} \sum_{k_1=1}^{K} \sum_{k_2=1}^{K} \sum_{i=1}^{n} \sum_{j=1}^{n} 1[|k_1 \neq k_2|] \cdot \log(1 - sim(p_i^{k_1},q_j^{k_2}))$$ Weakness: This method lacks the adaptive repulsion mechanism and just makes inter-class content queries orthogonality. **The advantages of our CSPCL:** Adaptive repulsion factor $R(k_1,k_2)=e^{1-\tau\cdot\big(1-\text{sim}(\mathbf{p}_i^{k_1},\mathbf{p}_j^{k_2})\big)}$ in IAR strengthens the repulsion between highly similar categories (like knife and saw) and weakens it for dissimilar categories (like knife and fireworks). This nuanced adjustment ensures that inter-class content queries align with the class prototypes' feature distribution, learning discriminative identifying features for similar categories, thereby improving the model's ability to distinguish between subtly different categories in overlapping scenes. **Analysis with Experimental Results on PIXray:** | Mthoed | AP | AP$_{50}$| AP$_{75}$ | AP$_S$ | AP$_M$ | AP$_L$ | |-|-|-|-|-|-|-| DINO |64.3 |86.5 |71.0| 19.3 |48.9| 73.9 DINO + MM loss(MMCL) |66.7| 87.5| 74.4 |23.5| 50.7 |75.5 DINO + MM-CSPCL loss|67.0|87.6|74.3|23.3|51.4|75.9 DINO + CSP loss(our CSPCL) |67.5 (+3.2) |88.7 |74.6| 23.3 |52.3 |76.8 In this table, MM-CSPCL loss refers to replacing the inter-class repulsion part of MMCL’s MM loss, namely the MIE loss, with the MIE-CSPCL loss. The result shows that MM-CSPCL loss outperforms the basic MM loss, proving the effectiveness of the CSPCL mechanism in guiding content queries using class prototypes. Furthermore, CSP loss outperforms both MM loss and MM-CSPCL loss, demonstrating that our CSP loss helps content queries learn more accurate category semantics, thereby enhancing the model’s ability to handle overlapping detection tasks. **(2)IIC loss [2]:** Inter-class Repulsion function is based on Kullback-Leibler divergence, as follows: $$L_{sKLD} = -\frac{1}{2} \left( D_{KL}(p^l_i || p^u_j) + D_{KL}(p^u_j || p^l_i) \right)$$ where $p^l$ and $p^u$ compose negative sample pairs, while $i$ and $j$ are the sample index number. Weakness: This method is designed for comparing only two classes of samples. In the multi-class, multi-sample space, it applies the same strength of inter-class repulsion across all categories, and does not ensure that inter-class content queries of similar categories maintain sufficient differences to learn discriminative identifying features. Intra-class Attraction function: $$L=\frac{1}{N}\sum_{i=1}^N D_{KL}(p_l||\hat{p_l}) + D_{KL}(\hat{p_l}||p_l)$$ where $\hat{p_l} $ is the probability distribution of augmented data, which forms positive sample pairs with $p_l$. This constraint helps the model learn consistent feature representations of intra-class. Weakness: It doesn't have the gradient truncation mechanism like CSPCL (Eq. 6 in the original paper) to prevent excessive homogenization of intra-class content queries. **(3)ICE loss [3]:** Intra-class Attraction function: $$L_{hins}=E\left[-\log\left(\frac{\exp\left(\langle f_i,m_k^i\rangle /\tau_{h_{ins}} \right)}{\sum_{j=1}^{J+1} \exp \left(\langle f_i,m_j\rangle/\tau_{h_{ins}} \right)}\right)\right]$$ Where $m_k^i$ represents the hardest positive instance for anchor $f_i$, and $\tau_{h_{ins}}$ is the temperature hyperparameter. This method is essentially the InfoNCE loss with a special construction method of positive sample pairs, and it only considers the furthest intra-class sample from the anchor/center. Weakness: It will lead the intra-class content queries to lose the necessary differences, as it lacks a gradient truncation mechanism like the ITA loss in our paper, which prevents intra-class homogenization. Intra-class Attraction function: $$Ls_{ins} = D_{KL}(P || Q)$$ Where $P$ and $Q$ represent the distributions of inter-instance similarities before and after augmentation. It is based on KLD loss and its defects are the same as previously mentioned IIC loss. **3.1 The models listed in Table 2 are quite similar;** CSPCL is specifically designed for the decoder structure of Deformable DETR models, and we validated its effectiveness on a series of Deformable DETR-based models, in Tab. 2, whose structures and focuses differ. Specifically, RT-DETR focuses on real-time detection with an efficient hybrid encoder to process multi-scale features. DINO emphasizes model convergence speed and optimizes bipartite graph matching stability with a contrastive denoising training strategy. AO-DETR focuses on anti-overlap detection with CSA strategy and LFD scheme to improve foreground classification and localization. Despite their different structures and training strategies, all follow the Deformable DETR decoder structure, so our CSPCL remains applicable, demonstrating strong generalization. **3.2 the authors should evaluate a broader range of more recent, diverse architectures to strengthen their analysis.** Thank you for the valuable advice! On the necessity of extending to more recent architectures: To the best of my knowledge, Deformable DETR-based models are currently the most advanced and accurate Transformer-based object detection architectures, outperforming CNN-based models like YOLO and Faster R-CNN. Feasibility analysis of extending CSPCL to other diverse architectures (CNN-based models): In YOLO or Faster R-CNN architectures, due to the lack of an encoder-decoder structure (no content queries), CSPCL cannot be directly incorporated to leverage content queries for learning category semantic priors in the classifier. However, inspired by WEN [4], we can attempt to adaptively align and combine features from the backbone with classifier prototypes to supplement the foreground feature semantics and improve model performance. We will try our best to evaluate the feasibility of this approach and include the results in the revised version. [1] MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection[J]. arXiv2024. [2] Li, Wenbin, et al. Modeling inter-class and intra-class constraints in novel class discovery.ICCV2023. [3] Chen, Hao, et al. Ice: Inter-instance contrastive encoding for unsupervised person re-identification. ICCV2021. [4] Tao, Renshuai, et al. Few-shot x-ray prohibited item detection: A benchmark and weak-feature enhancement network. ACMMM2022.Sorry for the late reply. I recognize the efforts made by the authors in the rebuttal, and most of my concerns have been addressed.
Dear reviewer, we appreciated your efforts when reviewing this paper. With only a couple of days remaining for discussion, could you please kindly provide your responses and further concerns (if any) to the authors' rebuttal? Thanks!
Dear Reviewer eTaq,
Thank you for your affirmation, encouragement, and valuable contribution!
If you have any concerns, please feel free to contact me, so I can further improve the quality of the paper and contribute to the community. Thank you for your valuable insights and selfless efforts.
Best Regards,
Authors.
Dear Reviewer,
Regarding Question 2: "The authors should include more recent loss functions to further validate their approach." We are pleased to have discovered a recent related article and have made a comparison with it. The analysis is as follows:
Object-aware Contrastive (OCA) Loss[1]: OCA loss is based on the InforNCE loss, and its intra-class aggregation functionality comes from the following formula:
The represents the similarity of positive sample pairs, is the similarity of negative sample pairs, and represents positive samples with cosine similarity below a threshold. This formula essentially aims to orthogonalize positive sample pairs that are already dissimilar.
Weakness: This method can not prevent the homogenization of similar positive sample pairs in the intra-class, while our ITA loss does. Therefore, this approach is not suitable for our multi-class and multi-sample task.
[1] Object-aware Sound Source Localization via Audio-Visual Scene Understanding, CVPR2025
This paper focuses on improving prohibited item detection in X-ray images, where the main challenge lies in the overlapping of objects that causes foreground-background feature coupling—something general object detectors struggle with. To tackle this, the authors propose a Category Semantic Prior Contrastive Learning (CSPCL) framework, which enhances the model’s ability to focus on foreground features by aligning class prototypes with content queries during training. The method is evaluated on multiple X-ray datasets (PIXray, OPIXray, PIDray, and CLCXray), showing consistent improvements over existing state-of-the-art methods.
优缺点分析
Strengths
- A plug-and-play module that can be integrated into Deformable DETR-based detectors without increasing inference complexity.
- Better performance than baseline methods.
Weaknesses
- Why aligning class prototypes with content queries can correct and supplement the missing semantic information is not immediately clear. The connection between the two is not direct, and the authors need to provide further explanation.
- Compared to existing DETR-based methods, what are the specific advantages of the approach proposed in this paper? In addition, the paper should also clarify the advantages of DETR-based methods over non-DETR-based methods.
- The performance of AO-DETR on PIDray in Table 3 is missing. This result is crucial for evaluating the effectiveness of the proposed method.
- The settings of C-AO-DETR (ours) and the comparison methods are different, which raises the question of whether the performance improvement comes from a better backbone or other configuration choices, rather than from the proposed method itself.
- The design of inter-class and intra-class losses is not novel, as it is conceptually very similar to approaches like triplet loss and contrastive loss. This limits the overall novelty of the paper.
问题
Please see Weaknesses.
局限性
Lack of sufficient (theoretical) explanation: The paper lacks a clear theoretical justification or intuitive explanation for the core claim that aligning class prototypes with content queries can correct and supplement missing semantic information. The logical connection between these el(ements is not fully established.
Unfair experimental comparisons: There are differences in experimental settings between C-AO-DETR and the comparison methods (e.g., DN-DETR), such as using a stronger backbone or more favorable training configurations. This makes it difficult to determine whether the performance improvements truly come from the proposed method itself.
最终评判理由
Apart from the last loss being somewhat limited in terms of novelty, my other concerns have been well addressed by the authors. Overall, I have raised my score.
格式问题
N/A
Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below.
1 Why aligning class prototypes with content queries can correct and supplement the missing semantic information is not immediately clear. The connection between the two is not direct, and the authors need to provide further explanation.
Sorry for the confusion! This alignment serves two purposes as follows:
(1) Correcting the misguiding information caused by overlapping background features: After being trained on X-ray images with overlapping phenomena, the content queries can become ambiguous and misled by irrelevant background information. Aligning the content queries with class prototypes enables the queries to focus on the features specific to their corresponding categories, thereby correcting the noisy information in the queries caused by the overlapping background in the training phase.
(2) Supplying the missing semantic information caused by the integration of positional encoding information in the decoder phase: As shown in Sec. 3.1 of the original paper, in the training phase, more and more positional encoding information is integrated into the content queries, which leads to the missing and catastrophic forgetting of the original class semantic information. The class prototypes are composed of classifier weights, which have learned the reliable inherent class semantic priors. Therefore, the alignment can ensure that the content queries do not deviate too much from the category semantic priors, supplying the missing and forgetting category semantic information.
Moreover, as shown in Fig. 4(c) of the original paper, the content queries of the original DINO are scattered randomly in the feature space and are far from the class prototypes (indicating low correlation between them). However, as shown in Fig. 4(d), after the alignment, the intra-class queries are clustered and distributed around their corresponding class prototypes (classifier weights). This also demonstrates that, through the alignment effect of CSPCL, the content queries are supplemented with effective category semantic information.
2.1 Compared to existing DETR-based methods, what are the specific advantages of the approach proposed in this paper?
CSPCL compensates for the core flaw of "category semantic ambiguity in content queries" in DETR-based models. Existing DETR-based methods (e.g., Deformable DETR, DINO) reduce model complexity and improve convergence speed with deformable attention, while reference point designs strengthen the spatial information of queries in the decoder (e.g., Conditional DETR, Anchor-DETR, DAB-DETR). Denoising training (e.g., DN-DETR, DINO) optimizes the stability of bipartite graph matching, and improved label assignment strategies (e.g., Stable-DINO, Group-DETR, H-DETR, Co-DETR) enhance the quality and quantity of samples for queries. However, these methods do not provide explicit class semantic guidance for content queries, leading to the loss of classification-related features during iterations.
CSPCL addresses this by using classifier weights as class prototypes, forcing content queries to align with the inherent feature distributions of their corresponding class prototypes (as shown in Fig. 1 of the original paper). This directly compensates for the core flaw of "category semantic ambiguity in content queries" in DETR, allowing the model to more effectively perceive and extract foreground information of specific classes from the overlapping features in X-ray images, as shown in Fig. 5a.
2.2 In addition, the paper should also clarify the advantages of DETR-based methods over non-DETR-based methods.
DETR-based methods use a Transformer encoder-decoder architecture, where queries in the decoder serve as prompts, incorporating various priors, such as category semantic priors (our paper) or location and size priors (Conditional DETR, Anchor DETR, and DAB DETR). In contrast, non-DETR methods like YOLO and Faster R-CNN rely only on anchor boxes to provide spatial priors, which is more limited.
DETR-based methods eliminate the need for anchor generation and NMS post-processing, simplifying the workflow and avoiding manual hyperparameter tuning for different datasets.
3 The performance of AO-DETR on PIDray in Table 3 is missing. This result is crucial for evaluating the effectiveness of the proposed method.
Thank you for your suggestion. We have now added the experimental results of AO-DETR on PIDray:
| Method | Backbone | Image Size | AP_{easy} | AP_{hard} | AP_{hidden} | AP_{overall} | PARAMs | FLOPs |
|---|---|---|---|---|---|---|---|---|
| AO-DETR | Swin-L | 81.8 | 72.6 | 55.7 | 70.0 | 229.0M | 276.5G | |
| -AO-DETR (ours) | Swin-L | 82.2 | 73.3 | 56.9 | 70.8(+0.8) | 229.0M | 276.5G |
Our CSPCL improves AO-DETR’s AP by 0.8.
4 The settings of C-AO-DETR (ours) and the comparison methods are different, which raises the question of whether the performance improvement comes from a better backbone or other configuration choices, rather than from the proposed method itself.
Thank you for your valuable suggestion. Our original intention was to train the best prohibited item detector, which led us to overlook the fairness of the comparison in Tab. 3 of the original paper. For fairness, we train C-AO-DETR (ResNet-101) on the PIDray dataset, and the comparison results are as follows:
| Method | Backbone | Image Size | AP_{easy} | AP_{hard} | AP_{hidden} | AP_{overall} |
|---|---|---|---|---|---|---|
| SDANet | ResNet-101 | 500 × 500 | 71.2 | 64.2 | 49.5 | 61.6 |
| ForkNet | ResNet-101 | 500 × 500 | 75.0 | 66.9 | 58.6 | 66.8 |
| -AO-DETR (ours) | ResNet-101 | 500 × 500 | 76.1 | 69.2 | 61.5 | 68.9 |
Under the same backbone network and image size, C-AO-DETR outperforms the latest prohibited item detectors SDANet and ForkNet. This result will be added to Tab. 3 in the original paper.
Please note that different types of models are suited for different optimizers and learning rates. For example, CNN-based SDANet and ForkNet use the SGD optimizer, while the DETR-based C-AO-DETR uses AdamW. Additionally, the optimal learning rate, momentum, and weight decay differ for each optimizer. Therefore, we used the default training strategy for each model's configuration. Furthermore, C-AO-DETR achieves the best performance at epoch 14, whereas SDANet was trained for 80 epochs with an early stopping strategy, and ForkNet did not specify its training strategy in the paper, nor is the code open-sourced.
5 The design of inter-class and intra-class losses is not novel, as it is conceptually very similar to approaches like triplet loss and contrastive loss. This limits the overall novelty of the paper.
Thank you for your valuable suggestion. We have conducted a comprehensive comparison and analysis of the three methods, along with experimental validation, and concluded that the CSP loss proposed in this paper is superior.
(1) Triplet Loss: Where and compose positive sample pair, while and compose negative sample pair. margin α is constant. is the euclidean distance/similarity. Triplet loss only ensures the distance between and is over margin α.
Weaknesses:
- Inter-class repulsion issue: Since the margin α is constant, it does not incorporate any adaptive mechanism for the similarity between categories, meaning it applies the same repulsion regardless of how similar the categories are (e.g., knife and saw).
- It is only applicable to triplet samples and cannot directly apply to multi-sample multi-class tasks.
(2) Contrastive Loss: Where is 0 if the sample pair is considered from intra-class and 1 otherwise, is the Euclidean distance of the sample pair, and is a margin.
Weaknesses: Contrastive loss pulls positive pairs together and pushes negative pairs apart but uses a fixed margin . This approach does not adapt to the similarities between classes, which could be problematic when distinguishing highly similar categories (e.g., knife and saw).
(3) CSP Loss:
CSP loss, as shown in Eq.(5-8) of original paper, beats Triplet loss and Contrastive loss by two mechanisms:
-
Gradient truncation mechanism in ITA ensures that intra-class diversity is preserved once similarity exceeds a threshold, avoiding over-homogenization of content queries within the same class.
-
Adaptive repulsion factor in IAR strengthens the repulsion between highly similar categories (like knife and saw) and weakens it for dissimilar categories (like knife and fireworks). This nuanced adjustment ensures that inter-class content queries align with the class prototypes' feature distribution, learning discriminative identifying features for similar categories, thereby improving the model's ability to distinguish between subtly different categories.
Experiments: We also tried transforming triplet loss and contrastive loss into forms suitable for multi-sample multi-class contrastive learning. By defining inter-class pairs as negative samples and intra-class pairs as positive samples, we reduced the task to a triplet problem and tested different margin values (0.9, 0.7, 0.5) in a set of experiments on PIXray. The results show that our method outperforms these two methods:
| Mthoed | AP | AP | AP | AP | AP | AP |
|---|---|---|---|---|---|---|
| DINO | 64.3 | 86.5 | 71.0 | 19.3 | 48.9 | 73.9 |
| DINO+Triplet() | 64.4 | 86.5 | 71.5 | 19.2 | 48.7 | 73.9 |
| DINO+Triplet() | 64.1 | 86.3 | 70.2 | 19.0 | 48.8 | 73.5 |
| DINO+Triplet() | 63.7 | 85.7 | 70.0 | 19.1 | 47.9 | 73.4 |
| DINO+Contrastive() | 64.9 | 87.1 | 71.5 | 20.0 | 48.9 | 73.8 |
| DINO+Contrastive() | 64.8 | 86.8 | 71.3 | 19.5 | 49.3 | 74.1 |
| DINO+Contrastive() | 64.4 | 86.4 | 71.2 | 19.4 | 48.8 | 73.6 |
| DINO + CSP (our) | 67.5 (+3.2) | 88.7 | 74.6 | 23.3 | 52.3 | 76.8 |
I sincerely appreciate the efforts the authors have made to address my concerns. I have read their response as well as the comments from the other reviewers. The authors’ response addressed most of my concerns. The final concern is that, although the authors have provided a very detailed explanation of the novelty and supporting experimental evidence, I still believe that the novelty of the CSP loss is incremental, as this idea has been used too frequently. Overall, considering the authors’ substantial efforts during the rebuttal period, I would consider increasing my score for the paper in Reviewer-AC round.
Dear reviewer, we appreciated your efforts when reviewing this paper. With only a couple of days remaining for discussion, could you please kindly provide your responses and further concerns (if any) to the authors' rebuttal? Thanks!
Dear reviewer BMUt,
Thank you again for your thoughtful review and constructive suggestions toward our work! We have tried our best to address all your questions and concerns, and have provided some theoretical analysis and experimental verification.
Your suggestions have a very positive impact on the quality of our paper. We sincerely appreciate your valuable opinions and your hard work! As the discussion period is drawing to a close, we are keen on getting your feedback and approval!
If you have any concerns, please feel free to contact us, so we can further improve the quality of the paper and contribute to the community! Thank you for your valuable insights and selfless efforts!
Best regards,
The Authors
Dear reviewer, as the discussion period is approaching ending, we would be appreciated if you can post feedback at this point especially where Authors ask for such discussions. For your comments that arrive late - Authors may have insufficient time now to address them in depth.
Dear AC and Reviewer BMUt,
We are truly grateful for Reviewer BMUt's willingness, during the Reviewer–AC round, to devote time to carefully discuss and analyze the value and significance of the CSP loss in our paper. We sincerely appreciate the hard work of AC and Reviewer; your rigorous academic ethos has deeply impressed and inspired us.
To prevent you from missing key information raised by other reviewers in CSP loss discussions during Reviewer-AC rounds, we would like to draw your attention to the responses to discussions with BMUt (Q5), eTaq (Q2), and GPtN (Q1), as well as the three discussion sections with reviewer GPtN, titled “Discussion on the Innovation of the Paper 1/2/3.” Through these responses and discussions, we recognize that CSP loss has tremendous potential value in this special multi-class multi-sample task, while the previous description of its innovations may not have been sufficiently clear and thorough.
Best Regards,
The Authors.
Dear Reviewer BMUt,
We greatly appreciate your willingness to carefully read all the responses from all reviewers. Thank you for your effort and for recognizing the value of our work.
Due to time constraints, we can only briefly summarize the strengths and innovations of our CSP loss here:
Our CSP loss, through the intra-class gradient truncation and inter-class adaptive repulsion mechanisms, ensures that intra-class content queries do not become homogenized, thus preserving the necessary feature diversity within the class. At the same time, it ensures that inter-class content queries align with the feature distribution of the class prototypes, learning discriminative identifying features for similar categories.
Through extensive theoretical analysis and experimental validation, we have demonstrated that CSP loss effectively aids the CSPCL mechanism in clarifying the class semantic information of queries in Deformable DETR-based models for the task of object detection in X-ray images. It improves the model's ability to extract and perceive anti-overlapping object features.
Best Regards,
The Authors.
Reviewers confirmed the novelty, thorough experiments and writing of this paper. After rebuttal and discussion, most of the concerns raised by reviewers are resolved the authors and all the ratings are positive. I thereby recommend to accept this submission.