5.5

/10

Poster4 位审稿人

最低5最高6标准差0.5

4.0

置信度

正确性2.5

贡献度2.5

表达2.5

NeurIPS 2024

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

Jia Syuen Lim,Zhuoxiao Chen,Zhi Chen,Mahsa Baktashmotlagh,Xin Yu,Zi Huang,Yadan Luo

OpenReview PDF

提交: 2024-05-14更新: 2025-01-16

摘要

关键词

class-agnostic object detectionVLMpromptingout-of-distribution object detection

评审与讨论

审稿意见

评分: 6置信度: 42024-07-09

The paper proposed a prompt expansion method to produce a diverse set of text queries for class-agnostic object detection. Sequentially performing inference using each text query and collating the prediction results often achieves high recall but incurs significant computational cost. Merging all text queries into one prompt reduces the cost significantly but results in much lower performance. The paper argues that this is due to the semantic overlap amongst the queries. To address this, the paper starts with a learnable parent prompt. After training the prompt embeddings with the standard object detection losses, the parent prompt is expanded into numerous child prompts by applying random rotations. The child prompts are then further learned with additional loss terms to decrease the similarity amongst the child prompt embeddings. This was shown to increase the maximum angular coverage, measured by the largest angle between the embeddings of two child prompts. Experiments demonstrate significant improvement of average recall and average precision on MS COCO and LVIST datasets.

优点

The paper aims to tackle the task of class-agnostic object detection, which is essential for various applications such as OOD detection. The task moves away from the classic close-vocabulary object detection problem, and has significant practical values as the real world is much less constrained.
The paper proposed an interesting approach to grow the set of text queries that are used to prompt a grounding model. The approach is geometrically motivated and is rather intuitive.
The paper provided extensive experimental results on benchmark datasets like MS-COCO and LVIS. The improvements in average recall (AR) and average precision (AP) over existing methods demonstrates great potential of the method.

缺点

The semantic overlap amongst words, as the motivation behind the paper, seems more like a hypothesis and was not sufficiently investigated. Lines 53-55 of the paper pointed out that merging all queries into one text prompt results in inferior performance compared to running inference on each query sequentially and combing the predictions. The paper then directly jumped to the conclusion that this was the result of "semantic overlap" between the queries, without any investigation or reference to prior investigations. This undermines the fluidity of the paper. Even though the proposed method does improve the performance, it may not be the solving the problem the paper is claiming to solve. The hypothesis should not be too hard to test. Semantically similar and dissimilar queries can be manually selected to compare their detection performance.
As I understand, the motivation behind merging queries as opposed to just merging predictions is to lower the computational cost. The paper would be more convincing if there is a inference cost comparison between the proposed method, the naive query-merging method and the prediction-merging method.
Some technical details of the paper were not stated very clearly and were somewhat hard to follow. There are also numerous typos and inconsistent use of notations. For instance, in section 3.1, subscripts of v denote the text embeddings of different words in the same prompt, but in section 3.2 and other subsequent sections, the subscripts seem to denote different text prompts.

问题

In Figure 3 (right), the average precision seems of the proposed method seems to drop first before increasing again. Is this due to random noise? Overall, it appears that increasing the number of prompts to 9 yields significant improvements but the trend is not clear before it.
In line 269, the paper claims that a higher MAC correlates with a broader spectrum of vocabularies. Aside from intuitions, is the range of vocabularies measured in some way? How did paper come to this conclusion?
How exactly is the prompt logit activation computed?

局限性

The authors discussed some technical limitations regarding the need of self-supervised learning and hyper-parameter tuning, both of which contribute to additional computational cost during training.

作者回复

2024-08-07

We thank the reviewer for the positive comment and constructive feedback! Please find our detailed response below:

R4.1 Clarification on "semantic overlap".

We appreciate the reviewer’s detailed feedback and understand that the section on “semantic overlaps” may seem disjointed. We conducted a pilot study using UNIVERSAL-query and CLASS-WIDE query sourced from ChatGPT and WordNet, respectively, to validate the efficacy of GroundingDINO in recognizing semantically similar and dissimilar words.

These results, shown in Table 1, support our hypothesis that semantic overlap negatively impacts detection performance. Additionally, a case study provided in Appendix A.2 reveals diminished confidence in GroundingDINO when presented with semantically overlapping queries, as exemplified by the contrast between "plates . cups ." (semantically different) and "plates . dishes ." (semantically similar), as illustrated in Figure 7.

Due to page limits, we regret any confusion caused by the placement of this crucial information in the supplementary materials, which might have been overlooked. To improve the clarity, we will incorporate this case study into the main body of the revised version.

R4.2 Inference cost comparison

Great suggestion! We have provided an additional inference cost comparison between our proposed method and two handcrafted baselines. Our results, detailed in Table 2, demonstrate that while the naive query-merging method reduces the computational cost, it significantly compromises detection accuracy due to overlapping semantics. In contrast, our proposed method strikes a balance, achieving superior detection performance. Specifically, our Dipex approach is 6.69% slower compared to the naive query-merging method, but it shows an impressive 89.34% reduction in inference time compared to the prediction-merging method.

R4.3 Technical details clarification

Thank you for your feedback! We apologize for the confusion regarding the notations. To clarify:

In Section 3.1, the contextual embeddings $\mathbf{v}$ are denoted as $\mathbf{v} =\lbrace\mathbf{v}_i\rbrace ^{M} _{i=1} \in\mathbb{R}^{M\times d}$ , where $M$ represents the number of $d$ -dimensional learnable tokens appended to a text query $\mathbf{c}$ . This follows the conventional way of introducing learnable context vectors [A].
In Section 3.2, we introduce a novel prompt expansion method. Here, the prompts hierarchy is denoted as $\mathbf{v}_{l,k}$ . The subscript $l$ indicates the layer in the tree or training round, while $k$ represents the number of learnable tokens in the $l$ -th layer, similar to the notation in Section 3.1.

We hope this clarification clears up any confusion regarding the notations.

[A] Kaiyang Zhou et al. Learning to Prompt for Vision-Language Models, in IJCV.

R4.4 Concerns on Potential Random Noise

Thanks! The COCOEval metric is particularly strict in evaluating detection performance, as it only matches detections with the highest Intersection over Union (IoU), which can penalize minor localization errors. Additionally, the MS-COCO dataset is not comprehensively annotated, which can even lower the average precision (AP) as correct predictions for unannotated instances are mistakenly counted as false positives. At the very initial training stage (l=1), the learnable tokens can be highly uncertain. Optimizing a small number of prompts can lead to instability, given the insufficient parameters to capture the diverse pseudo-labels. Selecting only one parent prompt with high uncertainty for expansion in the first iteration may not provide enough stability. However, as the number of prompts progressively increases, the model becomes more mature, and we observe a steady improvement in performance. This trend illustrates how increasing the number of prompts helps manage uncertainties and improves overall performance, thereby validating our approach.

R4.5 In line 269, the paper claims that a higher MAC correlates with a broader spectrum of vocabularies. Aside from intuitions, is the range of vocabularies measured in some way? How did paper come to this conclusion?

We appreciate the reviewer's thoughtful question! The concept of MAC (Mean Angular Coverage) is inspired by the WordNet hierarchy, where a larger MAC indicates the discovery of more high-level semantics. While MAC serves as an approximation rather than a direct measure of vocabulary range, it provides a practical criterion for determining when to stop the training process. The main motivation behind MAC is to capture the breadth of concept coverage without the need for direct interpretation of the learned tokens, which is inherently complex. Although MAC does not explicitly measure the vocabulary range, its correlation with high-level semantic discovery offers a useful approximation for our purposes. We hope this clarifies our rationale.

R4.6 How exactly is the prompt logit activation computed?

Thanks for raising this question. As stated in line 285, prompt logit activation refers to the number of activated prompts based on the confidence threshold. Figure 5 illustrates the activation frequency (in log scale) of each expanded child prompt, providing a clear visualization of how different prompts contribute to the detection process.

Thanks again for reviewing our paper! We are more than willing to have a follow-up discussion with you if you still have any further concerns!

评论- Follow-Up on Submission Responses

2024-08-10

Thank you once again for your thoughtful feedback on our submission. As we get closer to the end of the discussion period on August 13th, we wanted there are any additional questions or comments regarding our responses that you would like to discuss further.

Apologies for reaching out over the weekend—we know it’s not the most ideal time. However, your feedback is very important to us, and we aim to address any outstanding concerns before next Wednesday. Thanks.

Authors.

2024-08-12

Thank you for the clarification. My concerns are mostly addressed. I'm leaning towards accepting the paper provided that the authors revise the paper to include the clarifications in the response.

评论- Thank You for Reading and Consideration

2024-08-13

Dear Reviewer fTsG,

Thank you for your positive response and for raising your score! We are glad that our clarifications addressed your concerns and will ensure they are included in the final version.

Best,

Authors.

审稿意见

评分: 5置信度: 42024-07-11

This paper proposes a novel Dispersing Prompt Expansion (DiPEx) approach to enhance class-agnostic object detection (OD) using vision-language models (VLMs). The authors observe that manually crafted text queries often result in undetected objects due to semantic overlap, and address this by progressively learning a set of distinct, non-overlapping hyperspherical prompts. DiPEx starts with a generic parent prompt, selects the one with the highest semantic uncertainty for further expansion, and generates child prompts that inherit semantics from the parent while capturing more fine-grained details. Dispersion losses are employed to maintain high inter-class discrepancy among child prompts while preserving parent-child consistency. The method utilizes the maximum angular coverage (MAC) of the semantic space as a criterion for early termination to prevent excessive prompt growth. Experiments on MS-COCO and LVIS datasets demonstrate that DiPEx outperforms other prompting methods by up to 20.1% in average recall (AR) and achieves a 21.3% AP improvement over SAM for out-of-distribution OD.

优点

The proposed DiPEx approach seems novel and innovative to the class-agnostic object detection.
The paper is well-written and structured, with clear explanations of the concepts, techniques, and experimental setup.
The proposed DiPEx method has the potential to significantly advance the state-of-the-art in class-agnostic OD and out-of-distribution detection.
Some important works about class-agnostic learning could be appended: [1] In ECCV 2022. Pose for everything: Towards category-agnostic pose estimation. [2] In CVPR 2023. Matching is not enough: A two-stage framework for category-agnostic pose estimation. [3] In CVPR 2024. Meta-Point Learning and Refining for Category-Agnostic Pose Estimation.

缺点

The Abstract is not abstract enough.
The evaluation should be detailedly described. One intuitive way is to cast all categories into single category, but not all class-agnostic objects in dataset are annotated. How about the evaluation for these unannotated objects?
How about the efficiency of the proposed model in training/inference?

问题

See Weaknesses*

局限性

See Weaknesses*

作者回复

2024-08-07

We thank the reviewer for the constructive comments and suggestions, which we address below:

R3.1 Discussion on more related works

Thanks for bringing the literature to our attention!

[1] proposed POMNet, which leverages a transformer-based Keypoint Interaction Module (KIM) to capture interactions between keypoints and the relationship between support and query images. [2] introduced CapeFormer, a two-stage framework where keypoints are matched and treated as similarity-aware position proposals in the first stage, addressing noisy matching results due to the open-set nature of the problem. [3] introduces a meta-learning approach combined with iterative point refinement techniques for class-agnostic pose estimation.

Please note these methods are specifically designed for pose estimation, which involves predicting key-points on objects. Object detection, on the other hand, focuses on identifying and localizing entire objects within an image. The granularity and nature of the tasks are different, making direct application challenging. Though we find these methods relatively out of scope to our current work, we acknowledge the importance of this area. We will add a detailed discussion accordingly in our revised version to broaden its relevance.

R3.2 The Abstract is not abstract enough.

Thanks for your constructive comment! Our intent was to provide a comprehensive summary of our work, outlining the goals and motivation clearly. While we believe it accurately reflects our research, we will endeavor to make it more concise.

R3.3 Clarification on evaluation.

Thanks for your suggestion! We confirm that our evaluation approach involves casting all existing categories from MS-COCO and LVIS into a single class, and we will provide a detailed explanation in our revised manuscript. As highlighted in our introduction, the primary goal of our paper is to enhance Average Recall (AR), which measures how comprehensive objects are captured. Through visual inspections of ground truth (GT) against our results, we observed that COCO and LVIS are not densely annotated (see Figure 9-10 in the manuscript for more visualization). This is the primary reason our AP appears low; correct predictions for unannotated instances are mistakenly counted as FP in COCOEval. To address this, we plan to develop a fairer and more comprehensive benchmark in the future.

R3.4 How about the efficiency of the proposed model in training/inference?

Thanks for raising this important question! Regarding the efficiency of our model, we have noted this as a limitation in the conclusion section of our paper. We have provided an additional inference cost comparison between our proposed method and two handcrafted baselines in the attached PDF. Our results, detailed in Table 2, demonstrate that while the naive query-merging method reduces the computational cost, it significantly compromises detection accuracy due to overlapping semantics. In contrast, our proposed method strikes a balance, achieving superior detection performance. Specifically, our Dipex approach, is 6.69% slower compared to the naive query-merging method, but it shows an impressive 89.34% reduction in inference time compared to the prediction-merging method. We will incorporate the efficiency analysis into the main body of the revised version!

Thanks again for reviewing our paper! We are more than willing to have a follow-up discussion with you if you still have any further concerns!

评论- Follow-Up on Submission Responses

2024-08-10

Authors.

评论- Response to the author

2024-08-13

Thanks for the author's response which addresses most of my concerns.

评论- Thank You for Reading and Consideration

2024-08-14

Dear UZCf,

Thank you for your positive response and for raising your score! We are glad that our clarifications addressed your concerns.

Best,

Authors

审稿意见

评分: 6置信度: 42024-07-12

This work identifies that the "semantic overlaps" may contribute to the diminished class-agnostic object detection performance for previous works utilizing VLMs, which is evidenced by the pre-experiments on had-crafted text queries on the MS COCO dataset. Furthermore, the authors derive a self-supervised prompt learning strategy to iteratively expand the (soft) prompt set in a tree hierarchy to ensure the diversity and coverage of class-agnostic object textual descriptions. A maximum angular coverage metric is provided for expansion determination to balance the prompt diversity and number cost. Experiments on MS COCO and LVIS datasets have proved the method's effectiveness.

优点

The research topic is interesting and valuable. Class-agnostic may be one of the cornerstones for large foundation vision models.
The proposed prompt set expansion method is novel. The experiments have proved its efficacy.
The paper is well-presented and easy-to-follow.

缺点

My major concern lies in the author's claim that the "semantic overlap" among input words in the prompt declines object detection by visual-language matching. In Table 1, after the query-merging operation, the encoded text features are derived through the complicated self-attention mechanism across the input words (embeddings). Therefore, I prefer to owe the detection degradation to the disturbed attention compared to one input word in a single inference pass, rather than the so-called semantic overlap. In other words, Section 2 and Appendix 2 failed to relate semantic overlap with empirical results for me.
The overall framework is reasonable, but there are still some points to clarify:

For the children prompt initialization, it may not be sufficient to choose only one parent prompt with the high uncertainty to expand, especially for the first iteration when all of the root prompts are highly abstractive and therefore uncertain. In addition, Eq(2) may also be added to the children prompts with all of the prompts in previous generations, rather than only the chosen parent prompt.
For the MAC metric for expansion determination, it is weird to use the maximum angle to evaluate the diversity/coverage of ALL learned prompts. How about using the mean angular or KNN metrics?
Adding dispersion loss as well as the MAC metric at the input prompt embeddings is a little bit confusing, where the constraints may change after the encoding process. I wonder what will happen when adding the constraints to the encoded text features.
How to ensure the quality of pseudo labels for class-agnostic object detection?

The comparison with SAM on LVIS is unfair, as DiPEx is fine-tuned on the training set (even without box annotations) while SAM is a pre-trained model. Besides, is there some comparability between DiPEx and UniDetector[1]?

[1] Z Wang et al. "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR 2023.

问题

The detailed operations in Table 1 should be specified, eg., does prediction-merging means NMS? In addition, it would be better to report prediction-merging at the top line and query-merging at the bottom, which helps to identify the performance decline.
There seems some misuse of notations. For instance, is the definition of $P$ in line 134 and line 137?

局限性

The application of current work is limited by the training data sources, which only evaluate MS COCO and LVIS separately. Joint training with larger data sources (e.g., Object365) will improve the application value.

作者回复

2024-08-07

We thank the reviewer for the positive comments and constructive feedback.

R2.1 I prefer to owe the detection degradation to the disturbed attention, rather than the so-called semantic overlap.

We appreciate the insightful feedback. To clarify our "semantic overlap" hypothesis, we refer to the example in Appendix A.2. Our empirical study on "semantic overlap" is illustrated in Figure 7, where detection confidences are maintained with less similar words ("plates . cup ."), but performance drops with similar words ("plates . dishes ."). We conjecture that this is highly attributable to the dataset that was used during pre-training - there is a low chance of an image containing similar words [C], which may explain the reason why Grounding DINO was structured to favor specific categories over generic ones. Owing to this observation, we gained inspiration to hierarchically discover more fine-grained concepts from higher-level semantics.

[C] Shuai Shao et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection, in ICCV.

R2.2 Clarification & Additional Experiments

Great suggestion! We have run an additional experiment based on your suggestion, where we add additional prompts at the initialization stage. Please refer to Table 1 in the attached PDF for your reference. By increasing the number of initial prompts from ( $k=1$ to $k=5$ ), we observe an improvement in the detection performance.

R2.3 Why MAC metric for expansion determination? How about using the mean angular or KNN metrics?

We appreciate the reviewer's suggestion regarding the MAC metric for expansion determination. Our choice of maximum angular coverage (MAC) is inspired by the observation of semantically dissimilar words have larger angular discrepancies as hypothesized in our case study in Appendix A.2. We aim to capture semantics as comprehensively as possible to improve recall. The angular discrepancy between prompts is enforced by dispersion loss, where we intend to push child pairs further, and maximum angular deviation represents the comprehensiveness of the vocabs. While using mean angular metrics is plausible, it tends to find fine-grained concepts such as "apple" & "oranges", which can be easily dominated by the abundance of fine-grained words. This contrasts with our goal of promoting diversity among learned prompts from a global perspective. KNN on the other hand may signify how detailed for a particular pair and the results can be very restricted to a particular concept. Due to limited time during rebuttal, we will add comparisons in the revised version.

R2.4 I wonder what will happen when adding the constraints to the encoded text features.

Thank you for your feedback！ As we are unsure about your concern, we interpret your questions as "how dispersion loss can affect the encoded text features". Firstly, we would like to clarify that MAC is the criterion for early stopping and it is not directly used in our optimization process. Dispersion loss, on the other hand, is our training objective which ensures that child prompts are sufficiently distinct from each other while not straying too far from their parent prompts. This combination ensures that our approach effectively captures diverse and unique concepts while maintaining coherence with the broader parent prompts. The learnable prompts are directly appended to the encoded text. The encoded textual query (e.g., "generic") is only used as a guide to acquiring pseudo-labels and it is not optimized as part of the learning objective.

R2.5 How to ensure the quality of pseudo labels for class-agnostic object detection?

Thanks! To re-iterate, we iteratively refine our pseudo-labels to ensure their quality during the training process. We start with predictions from an off-the-shelf GroundingDINO model, removing low-confidence detections and excessively large bounding boxes. At each self-training stage, we update the pseudo-labels by performing inference using the learned prompts. To avoid duplication, we eliminate boxes with an IoU greater than 0.5 compared to the previous "ground-truth" boxes and apply SoftNMS.

R2.6 Comparability with UniDetector

Thanks for the great suggestion! It is important to clarify that UniDetector is an Open-Vocabulary detector (similar to what GroundingDINO was originally designed for), which differs from our class-agnostic settings. UniDetector requires comprehensive class labels for training, whereas our approach only relies on pseudo-box supervision. Notably, UniDetector also proposes a class-agnostic detector (CLN), which combines both the RPN and RoI head to generate proposals for universal object detection. This highlights the significance of class-agnostic object detection in Open-World scenarios, aligning with our focus on detecting every object in the scene without relying on predefined class labels.

Q2.1 Detailed operations in Table 1

Thank you for your suggestion! As described in lines 52-57, query-merging involves concatenating all queries into a single string (e.g., "objects . generic . entities ."), whereas prediction-merging entails combining the results from separate inferences using individual text prompts (e.g., "objects", "generic", "entities"). We apologize for any confusion and will clarify this distinction in the revised manuscript.

Q2.2 Misuse of Notations

We appreciate your feedback! In this paper, we have used the terms "prompt" and "word" interchangeably, which may have caused some confusion. The notation in lines 134 and 137 is accurate; however, we will revise the manuscript to ensure that $P$ consistently refers to "prompt embeddings" for clarity.

Thanks again for reviewing our paper! We are more than willing to have a follow-up discussion with you if you still have any further concerns!

评论- Follow-Up on Submission Responses

2024-08-10

Authors.

2024-08-11

I appreciate the author's responses. I have raised my score to 6. Good luck.

评论- Thank You for Reading and Consideration

2024-08-13

Dear Reviewer 86g8,

Thank you for your positive response and for raising your score! We are glad that our clarifications addressed your concerns.

Best,

Authors

审稿意见

评分: 5置信度: 42024-07-13

This study investigates the use of visual-language models to improve class-agnostic object detection through a self-supervised prompt learning strategy. Diverse Prompt Expansion (DipEx) is proposed to enhance downstream task performance by learning to expand a set of diverse, non-overlapping prompts that boost recall rates of object detection.

优点

The authors find a new setting, class-agnostic object detection, which is practical and universal in real-world scenarios.
The proposed method achieves performance improvements compared to its baselines.
The figures are rich and vivid, and the writing is good.

缺点

I'm mainly concerned about the novelty. The primary contribution of this paper lies in the prompt expansion method, which uses contrastive loss for optimization to acquire diversified prompts. However, this kind of optimization-based prompt diversification is not new art [1,2], especially as the previous work [1] presents a rather similar concept of optimizing prompts on the hypersphere.

[1] Promptstyler: Prompt-driven style generation for source-free domain generalization. ICCV, 2023.

[2] Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification. ACM MM, 2023.
The proposed prompt expansion method seems not essentially related to object detection task, since it may also suitable to other vision tasks. I thins it's may be not proper to sell the class-agnostic object detection as a major contribution in this paper.
As the authors claim that this paper aims to solve class-agnostic task, I am highly curious about how to obtain the pseudo labels during training. Does the pseudo labels get updated during prompt-based optimization? If updated, how to do this?
What does the learned prompt resemble? More visualization results, such as the attention visualization, of the learned prompt are expected.
Some presentations leave me confused. For example, in Line 117-118, the authors state that, ``applying query-merging to UNIVERSAL words results in a 52.46% reduction in AR compared to prediction-merging, whereas CLASS-WIDE queries (e.g., from WordNet) achieve a smaller decrease in AR of only 23.64%''. However, I fail to understand how the 52.46% and 23.64% reductions are computed by the results in Table 1.

问题

Please refer to the weakness.

局限性

The authors have stated the limitations in the conclusion part, and I hope they can do some further validations to explore the effecitveness of the proposed method on open-vocabulary and open-world detection.

作者回复

2024-08-07

We thank the reviewer for the constructive comments and suggestions, which we address below:

R1.1 Novelty and comparisons with [1,2]

Contributions. We appreciate your feedback and would like to clarify the major contributions of our work:

(1) General Impact: Our work presents an early analysis of the bottleneck of current class-generalized detection tasks such as OOD detection. The challenge of locating every possible object remains a fundamental challenge before being able to improve AP for classes of interest (line 28-32) has been rarely studied. Further, prior OOD works are incomparable due to inconsistent evaluation metrics (e.g., open-set benchmark uses FPR95, and unsupervised class discovery uses CorLoc). DiPEx is the first work to comprehensively benchmark existing OOD detection (open-set/open-world detection) and unsupervised object discovery against class-agnostic OD setting, providing valuable guidance for future research.

(2) Technical Contributions: We would like to point out that instead of using off-the-shelf contrastive loss ([1]&[2]), all our design is motivated by angular perspective to discover unrevealed semantics by pushing children prompts away while maintaining coherence with the parent prompts. The rationale behind this was based on our observation in pilot study (Appendix A.2 & Figure 7), where two text tokens that are semantically dissimilar exhibit large angular distances lead to more confident box predictions. This motivates us to hierarchically decompose high-level semantics into finer-grained ones, aiming to uncover more semantics.

Compared to SOTA. We appreciate the reviewer for highlighting [1, 2]. Key differences include:

(1) Task. Both [1] and [2] focus on domain adaptation for style diversification with classes available, or in other words, diversifying the suffixes prior to the class token. Additionally, both methods are constrained to a fixed number of style prompts. On the contrary, our design aims to disentangle arbitrary numbers of fine-grained classes, followed by a MAC early stopping strategy to prevent the excessive growth of class prompts. Notably, we are the first one to introduce progressive prompt expansion, proving its effectiveness. Our task, without class vocabulary available, is much more challenging.

(2) Objectives. Both [1] & [2] use off-the-shelf contrastive loss, which significantly differs from our dispersion loss. Contrastive loss separates inter-class samples; this contradicts our motivation of revealing classes in a tree hierarchy fashion. Additionally, orthogonal constraint in [1] is unsuitable; recall our ablation study (see Appendix A.2) - semantically similar words can have small angular discrepancy (e.g., "plates" & "dishes" have angular distance $\theta$ of 53.73 $^\circ$ ). This means that the orthogonal ( $\theta=90^\circ$ ) constraint is too much and will push prompts too far and distort underlying representation. Therefore, we have opted to use dispersion loss as a softer constraint to enforce separability between child prompts while maintaining child-parent coherence.

R1.2 The proposed method may also be suitable to other vision tasks.

We appreciate the reviewer's feedback. Adapting existing prompting methods (e.g., CoOp), designed primarily for classification tasks, to detection yields suboptimal performance (Table 2, 3), especially with a large number of classes (Figure 3). This is because Grounding DINO uses class and query confidence to find top-k proposals, and if class prompts are not accurately learned, many correct boxes will be missed due to low confidence. Our hierarchical expansion allows us to find accurate fine-grained semantics, specifically designed to help detection-based foundation models identify more objects and improve recall. Furthermore, detection serves as a basis for downstream applications, such as multi-object tracking, which we plan to explore in future work.

R1.3 Does the pseudo labels get updated during prompt-based optimization? If updated, how to do this?

Yes, our pseudo-labels are iteratively refined with each training process during the training process. (1) We start with inference using an off-the-shelf GroundingDINO model with "generic" and preprocess the predictions by removing low-confidence detections and excessively large bounding boxes. (2) At each self-training stage, we perform inference again with the expanded prompts to obtain updated pseudo labels. (3) For quality control, we remove boxes with an IoU greater than 0.5 with the previous "ground-truth" boxes and apply SoftNMS. This ensures the continual improvement of box accuracy and promotes the model to discover emerging concepts.

R1.4 What does the learned prompt resemble?

Great question! The interpretability of context tokens remains as an open question [A, B]. Prior arts [A] have attempted to use NN words from pre-trained embeddings to interpret the learned vectors. However, since text features operate within a continuous embedding space, they likely carry more abstract meanings that are not readily interpretable. Regardless, we have provided a visualization of the attention features from each distinct prompt in the attached PDF for your reference.

[A] Kaiyang Zhou et al. Learning to Prompt for Vision-Language Models, in IJCV.

[B] Brian Lester et al. The Power of Scale for Parameter-Efficient Prompt Tuning, in EMNLP.

R1.5 However, I fail to understand how the 52.46% and 23.64% reductions are computed by the results in Table 1.

Thanks! The percentage reduction was calculated using (query_merged-prediction_merged)/query_merged. We will provide more clarification in the revised version.

Thanks again for reviewing our paper! We are more than willing to have a follow-up discussion with you if you still have any further concerns!

评论- Follow-Up on Submission Responses

2024-08-10

Authors.

评论- Reply to the rebuttal (JT2M)

2024-08-13

Thank you for the reply. My concerns have been mostly addressed so that I consider raising my rating score. And I hope that the authors could further elaborate on how their work diverges from related studies in prompt learning, such as [1,2], in the revised manuscript. Additionally, I believe there is a need for further clarification in other areas as I have stressed.

[1] Promptstyler: Prompt-driven style generation for source-free domain generalization. ICCV, 2023.

[2] Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification. ACM MM, 2023.

评论- Thank You for Reading and Consideration

2024-08-14

Dear Reviewer Jt2M,

Thank you for your positive response and for raising your score! We are glad that our clarifications addressed your concerns. In our revised manuscript, we will make sure to include a discussion on the contrasts between our work and the literature you referenced [1, 2]. We will also further elaborate on the additional concerns you have raised in your review.

[1] Promptstyler: Prompt-driven style generation for source-free domain generalization. ICCV, 2023.

[2] Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification. ACM MM, 2023.

Best,

Authors.

作者回复

2024-08-07

Dear Reviewers,

We would like to extend our sincere gratitude for your thoughtful and encouraging feedback.

We are pleased to see that our exploration into class-agnostic object detection was recognized as practical and universal in real-world scenarios, with performance improvements over baseline approaches (Reviewer Jt2M, 86g8, fTsG). Your acknowledgment of the foundational potential of class-agnostic detection for large vision models and the novelty of our prompt set expansion method is greatly appreciated (Reviewer 86g8, fTsG).

We are grateful for the positive feedback on the clarity, structure, and quality of our writing and figures (Reviewer Jt2M, 86g8, UZCf). The recognition of the potential impact of our DiPEx approach on the state-of-the-art in both class-agnostic and out-of-distribution detection is particularly gratifying (Reviewer UZCf, fTsG).

The appreciation for our geometrically motivated and intuitive approach to expanding text queries for grounding models, along with the comprehensive experimental results on benchmark datasets like MS-COCO and LVIS, further encourages us (Reviewer fTsG)!

The attached PDF contains additional experiments for reference:

Visualization of the attended areas activated by the learned prompts
The impact of different lengths of prompts for initialization
Inference cost comparison in terms of time and memory consumption

Thank you once again for your valuable feedback. We will incorporate the revisions accordingly to improve the quality of this work. Please let us know if you have any additional questions, and we are more than happy to address them!

最终决定Accept (poster)

2024-09-25

This paper presents a method for class-agnostic object detection with an interesting prompt expansion method. The three reviewers are all positive about the novelty and technical contributions of the work. They had some comments regarding complementing some necessary experimental results and adding missing references. The authors may consider those comments in the revision. The AC is satisfied with the submission and recommends acceptance as poster.