Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
We present VL-SAM, a framework that combines vision-language model with segment-anything model to address the open-ended object detection and segmentation task.
摘要
评审与讨论
This paper proposes a training-free open-ended object detector that leverages VLM(CogVLM) to recognize and roughly locate objects(through the attention map) and prompts SAM by the coarse points results. To generate accurate point prompts, VL-SAM utilizes techniques like head aggregation, attention flow, attention score regularization, iterative refinement, and scale/prompt ensemble. VL-SAM achieves state-of-the-art performance on open-ended LVIS and CODA. The ablation proves that each component can improve the detection performance.
优点
Combining already trained large models to achieve tasks that each cannot accomplish individually is a promising direction. The techniques proposed by the authors—head aggregation, attention flow, attention score regularization, iterative refinement, and scale/prompt ensemble—are all intuitively effective methods for improving zero-shot performance.
缺点
-
Speed. For each image, following the VL-SAM approach requires first passing through CogVLM-17B ten times, with each pass covering five scales (full image + 4 sub-images). Afterward, attention needs to be stored and calculated to obtain the prompt. Each prompt then requires iterative refinement, with each refinement step involving multiple passes through the SAM-huge model. Each step may also require post-processing like NMS. Such heavy and non-parallelizable computation raises curiosity about the model's FPS. Is the performance gain really worth it?
-
Unclear experimental details. Table 3 indicates that the highest improvements come from question ensemble and multi-scale operations. Firstly, these operations cannot be considered core contributions of VL-SAM, as they are widely used tricks. Secondly, examples of question prompts are not provided. Given that question ensemble improves performance significantly, additional analysis is required to understand why.
-
SOTA performance. On the LVIS minival dataset, the authors primarily compare against GenerateU, as mentioned on line 114. DetClipV3 also proposed an open-ended setting and achieved higher performance, yet this is not reflected in Table 1. Additionally, it should be noted that GenerateU used different types of CLIP text encoders for training and evaluation. In my attempt to unify both for retraining, I achieved approximately 25 box AP on rare classes, which is close to the 23.4 performance mentioned in the paper (though this will not affect my scoring, as it is not an issue with VL-SAM). On the CODA dataset, VL-SAM achieved performance far exceeding other methods. However, considering that the evaluation metric is AR, a natural question arises: what would be the AR if SAM's segment-anything mode (uniform sampling) was used? This is necessary to demonstrate the need for VL-SAM. Additionally, in Table 3, VL-SAM without the two ensemble methods only achieves 14.1 performance on CODA, while using both methods introduces significant speed overhead (see point 1).
-
Parameter: VL-SAM defaults to using some of the largest foundation models (SAM-Huge, ViT-E). Is the comparison of VL-SAM's parameter count to other methods fair? Additionally, what would VL-SAM's performance be if smaller-scale models were used?
问题
- types: line 44 as a pure visiothe n model
- What is the question-prompt, and how does it affect the final performance?
- How much would VL-SAM's performance be affected by the model size?
- see weakness.
局限性
yes, the authors has adequately discussed the potential negative societal impact of their work and limitations.
Q1: Speed and parameter.
As we discussed in the limitation, the speed and parameter problem can be gradually overcome by recent lightweight models. Besides, the main purpose of this paper is to provide a feasible solution to address the open-ended perception challenge; thus we do not consider the speed and parameters right now.
Additionally, we analyze the effect of SAM‘s parameters on VL-SAM, as shown in the following table. We can find that a larger model achieves better performance for VL-SAM. Besides, when parameters are reduced from 636 M to 91 M of SAM backbone, the performance of VL-SAM only drops 2.2 mAR. This demonstrates the effectiveness of the proposed framework once again.
| SAM backbone | Parameters | CODA mAR |
|---|---|---|
| VIT-H | 636 M | 40.1 |
| VIT-L | 308 M | 39.3 |
| VIT-B | 91 M | 37.9 |
Q2: Details and analysis of prompt generation.
We directly ask VLM to generate question prompts by itself without given images. Specifically, we use the question:
If we want you to list all possible objects in the given image, what questions should we ask? Please give 10 questions you prefer.
Here is the answer from CogVLM:
(1) Please analyze the image and list all the objects present.
(2) Identify and provide a comprehensive list of objects in the image.
(3) Using the image, generate a detailed inventory of all objects visible.
(4) Analyze the image and extract information about any objects present.
(5) Please describe all the objects you can identify in the image.
(6) From the image, generate a report listing all objects detected.
(7) Utilizing the image, identify and report all objects visible.
(8) Analyze the image and provide a comprehensive breakdown of all objects seen.
(9) Please process the image and generate a detailed list of all objects present.
(10) Using the image, identify and describe all objects observed.
Moreover, as Reviewer 6VtQ mentioned, we provide the average recall (AR) of VLM for object name generation in the following table. We find that prompt ensemble significantly alleviates the object missing problem in VLM. Thus, it can improve overall performance.
| Dataset | Prompt Ensemble | |||
|---|---|---|---|---|
| LVIS | 0.973 | 0.932 | 0.404 | |
| LVIS | 0.988 | 0.984 | 0.604 |
We will add the discussion to the paper.
Q3: AR of SAM's segment-anything mode (uniform sampling).
We use SAM's segment-anything mode to obtain box proposals and calculate the class-agnostic AR. The AR is 29.7. However, the boxes predicted by SAM do not contain category results. In contrast, VL-SAM can predict boxes with categories.
I have read the all the rebuttal from the authors and review from other reviewer. I appreciate the author's efforts. One of the motivations for open-ended object detection is to make object detection more applicable in real-world scenarios. While speed might not be the primary concern, other open-ended methods in the literature also use heavy decoders(compared to MLPs in traditional detector) to generate class names, and the operations proposed in this paper are clearly associated with significant delays. As a technical paper, it is common to consider the trade-offs between speed and accuracy. I do not wish to undermine the advantages of the proposed training-free open-ended approach, but in the absence of other exceptional contributions, and given that proposed operations such as "Iterative Refine" and "Multi-scale" evidently introduce delays, a detailed discussion is essential. I will raise my score to Borderline, but with a slight inclination towards rejection.
Thanks for your valuable time and your reply. We respond to your remaining concerns below:
Q: Consider the trade-offs between speed and accuracy
A: We agree that many technical papers consider the trade-offs between speed and accuracy. However, there are many papers that only consider developing high-accuracy detectors, including InternImage[1], Co-DETR[2], and CBNet[3]. Besides, high-accuracy detectors (including our method) can be applied to offline scenarios, such as auto-labeling for autonomous driving data.
Thus, we argue that the proposed method can be a feasible solution to address the open-ended perception challenge and should not be ignored by the community.
We sincerely thank you for your active involvement in the discussion and hope to hear more from you about your concerns and kind suggestions.
[1] Wang, Wenhai, et al. "Internimage: Exploring large-scale vision foundation models with deformable convolutions." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
[2] Zong, Zhuofan, Guanglu Song, and Yu Liu. "Detrs with collaborative hybrid assignments training." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
[3] Liang, Tingting, et al. "Cbnet: A composite backbone network architecture for object detection." IEEE Transactions on Image Processing 31 (2022): 6893-6906.
This work proposes a novel approach for the so-called open-ended detection (and segmentation) problem, which is about localizing and naming objects in a given image without the user having to specify any pre-defined label space. This is relevant for scenarios where it is hard to define a complete list of object categories that are relevant for the task. For example, autonomous agents acting in the real world may encounter all sorts of objects, but they may not be detected if their category name is missing in the pre-defined label space. The proposed algorithm is interesting because it does not require any model training, but uses readily available vision-language and class-agnostic segmentation models. The experiments also demonstrate strong performance compared to prior works in this setting.
优点
- Although one might argue that the proposed approach is just a collection of existing models, the proposed combination works, does not require any training, and achieves good results.
- I think the proposed method can be a very useful tool for many in the research community and in industry. The paper can also be seen as a recipe on how to build similar pipelines for other tasks, without the need to train models.
- From Section 3 onward, I think the paper was easy to follow. The method is well explained.
- The ablation study of all individual components is great.
缺点
- Experiments
- The numbers for "open-set" (or open-vocabulary) in Table 1 are not state-of-the-art. Looking at OWLv2 [A], even the models not using the LVIS dataset achieve 35.4, 39.0 and 40.9 AP-rare on LVIS-mini with different backbones.
- The numbers for non-rare classes in LVIS are not reported in Table 1. Assuming the proposed approach should be a competitive replacement for other object detectors (fixed set, open-set), then I would want to know the numbers on the base classes as well. I expect such a method that relies on attention maps for localization may underperform compared to other methods.
- One limitation of the proposed two-step pipeline can be that the initial VLM misses some object names in the caption. The model would not recover even if the segmentation model could segment those objects. Although I saw some Average Recall (AR) numbers in the experiments, I'd be interested in recall of the VLM alone in identifying all object category names that are present. That would have been an easy experiment that should be conducted on standard detection datasets like COCO and LVIS.
- Paper writing
- I had real troubles reading the abstract (and most parts of the introduction as well) and trying to understanding what the task is being solved. The exact definitions of the terms open-world, open-set and open-ended that are used in the abstract may not be clear to everyone. In fact, I think these terms are used inconsistently throughout the ML/CV literature.
- The statement that "pre-defined object categories are not available in real-world scenarios" probably needs (much) more context to be a valid argument. For instance, autonomous vehicle companies have a list of a few hundred object categories that they expect on the road - hence, it's pre-defined - while unknown objects are handled differently through generic obstacle detection.
- The name "GenerateU" is used in the abstract without a reference - that's not common knowledge yet - it's a CVPR'24 paper that was recently presented.
- More details are needed in some parts of the paper:
- The reason for collapse of the aggregated attention maps and the corresponding regularization need more details.
- Section 3.6 needs more details. To me it seems like the VLM is asked to generate prompts for itself. Is that then dependent on the input image? And what's the prompt to generate more question pairs?
References:
- [A] Scaling Open-Vocabulary Object Detection. Minderer et al. NeurIPS'23
问题
- Open Set methods described in lines 104ff are also often referred to as open-vocabulary perception models. I would say more commonly they are referred to as such.
- Typo in line 153? "casual" -> "causal"? Same in line 169.
- The indexing in Eq. 3 regarding the layer index is inconsistent with the above definition.
局限性
Limitations of the proposed method were discussed in the main paper. One potential limitation to add would be the inherent two-step approach of the proposed pipeline which cannot recover from mistakes that the initial VLM makes in the captioning output, for instance missing object names.
Q1: The performance of OWLv2.
Thanks for reminding. We will add it to the paper.
Q2: Results for non-rare classes in LVIS.
The following table provides results for common () and frequent () classes of LVIS. We can find that, though Close-set and Open-set methods achieve better results on and , they require 1203 object names from LVIS as input. Besides, our method is training-free and achieves competitive results with GenerateU.
| Method | Type | Require Category | Training | Mask | |||
|---|---|---|---|---|---|---|---|
| Mask R-CNN | Close-set | Yes | Yes | Yes | 26.3 | 34.0 | 33.9 |
| Deformable DETR | Close-set | Yes | Yes | No | 24.2 | 36.0 | 38.2 |
| GroundingDINO | Open-set | Yes | Yes | No | 18.1 | 23.3 | 32.7 |
| DetCLIP | Open-set | Yes | Yes | No | 26.9 | 33.9 | 36.3 |
| YOLOWorld | Open-set | Yes | Yes | No | 27.1 | 32.8 | 38.3 |
| GenerateU | Open-ended | No | Yes | No | 22.3 | 25.2 | 31.4 |
| Ours | Open-ended | No | No | Yes | 23.4 | 25.3 | 30.0 |
Q3: Average recall (AR) of VLM for object name generation.
The following table provides the AR of VLM for object name generation. presents the similarity score calculated by CLIP larger than x is viewed as positive.
| Dataset | Prompt Ensemble | |||
|---|---|---|---|---|
| LVIS | No | 97.3 | 93.2 | 40.4 |
| LVIS | Yes | 98.8 | 98.4 | 60.4 |
Q4: Cannot recover missing objects.
As shown in the table of Q3, we are currently trying to reduce the number of missing objects with the prompt ensemble. In the future, we will adopt techniques like Chain-of-thought to recover missing objects.
Q5: Clarifying the concepts of open-world, open-set, and open-ended perception.
As the reviewer mentioned, these three terms are used inconsistently in the ML/CV literature. In our opinion, open-world perception is a broad concept. It tries to give precise results in dynamic and unpredictable environments, which contain novel objects and involve scene domain shifting. Open-set and open-ended perception are subtasks of open-world perception, and try to address the novel objects problem. Specifically, open-set perception, like the grounding task, predicts object locations when given images and novel object names. In contrast, open-ended perception is similar to traditional perception that predicts novel object locations and their names simultaneously when only given images. We will add the discussion to the paper.
Q6: Argument of "predefined object categories are not available in real-world scenarios."
The reviewer mentioned that "autonomous vehicle companies have a list of a few hundred object categories - hence, it's predefined." Though they have predefined hundreds of categories, there are still categories they may not include, such as various rare animals. Besides, some objects cannot be presented by a simple category name, such as a human in an animal costume, which may look like an animal but is actually a human. As the reviewer said, generic obstacle detection can handle some unknown objects. However, many things do not have a significant 3D shape, like pits or grains on the ground. Thus, open-set methods cannot handle all situations. We will add the discussion to the paper.
Q7: Reason for attention maps collapse when aggregation.
The collapse in attention map aggregation is caused by the causal mask. For example, assuming there are uniformly distributed attention maps (33 for simplicity) in all transformer layers (total 2 layers for simplicity):
\\begin{pmatrix} \frac{1}{3}&\frac{1}{3}&\frac{1}{3}, \\ \frac{1}{3}&\frac{1}{3}&\frac{1}{3}, \\ \frac{1}{3}&\frac{1}{3}&\frac{1}{3} \\end{pmatrix}After using the attention rollout method with the causal mask, we can obtain the final attention map:
We can find that, for each row, the number in the front column is greater than in the back column, e.g., and . Thus, simply adopting attention rollout with the causal mask will centralize the attention activation in the front patch, i.e., the top left corner of the image in order, as shown in Figure 5 in the paper.
Q8: Details of regularization.
As discussed in Q7, directly adopting attention rollout will make the number in the front column greater than in the back column. We introduce a simple regularization method that multiplies a small term to the front column and a bigger one to the back column. The value of the term uses a simple linear descent: .
Q9: Details of prompt generation.
We directly ask VLM to generate question prompts by itself without giving images. Specifically, we use the question:
If we want you to list all possible objects in the given image, what questions should we ask? Please give 10 questions you prefer.
Here is the answer from CogVLM:
1) Please analyze the image and list all the objects present.
2) Identify and provide a comprehensive list of objects in the image.
3) Using the image, generate a detailed inventory of all objects visible.
4) Analyze the image and extract information about any objects present.
5) Please describe all the objects you can identify in the image.
6) From the image, generate a report listing all objects detected.
7) Utilizing the image, identify and report all objects visible.
8) Analyze the image and provide a comprehensive breakdown of all objects seen.
9) Please process the image and generate a detailed list of all objects present.
10) Using the image, identify and describe all objects observed.
We will add the details in the paper.
Q10: Typos and indexing inconsistency.
Thanks. We will revise them in the paper.
I have read the rebuttal. I appreciate the author's efforts in addressing all of concerns - most of which with satisfactory explanations. I currently intend to keep the rating positive.
In this paper, the authors proposed to combine vision language models and segment anything model (SAM) for open-ended object detection. Attention maps generated from vision language models were used to prompt SAM. Experiments on multiple benchmark datasets demonstrated better performance over several baseline methods.
优点
- Combining vision language model and SAM is very interesting and could lead to lots of useful applications. In this paper, the authors proposed an efficient and effective approach to connect them.
- The proposed attention map generation from vision language models and prompt generation for SAM make sense and are technically sound to me.
- Extensive experiments and ablation studies were conducted to validated the effectiveness of the proposed approach.
缺点
- Even though the authors claimed that the proposed approach is open-ended, the evaluation is still on predefined object category names and the performance is worse than some open-set baseline methods.
- The performance of the proposed approach heavily depends on the vision language models since the attention map is the key component. Thus, how to choose a good vision language model is essential for the proposed framework. Even though the authors did ablation study of model generation in Table 5, I would expect more discussion on how to choose the best vision language model. Are there any principals or characteristics required?
- Several typos: "visiothe n" in line 44 and "Generobtainsnd" in line 71.
问题
Please refer to the weakness section to discuss more on the evaluation and vision language model selection.
局限性
The authors discussed the limitations in the draft and I have no further suggestions for improvement.
Q1: Evaluation is still on predefined object category names.
The inference of VL-SAM does not rely on predefined object category names; we only use them to calculate the evaluation metrics.
Specifically, as Lines 211-213 mentioned, we use VL-SAM to generate object categories by itself. However, the generated object categories from VL-SAM may not align with the category names specified in the LVIS dataset. For example, VL-SAM may generate "kids" or "adults," while the LVIS dataset provides a "person" label for these objects. To address this, we follow GenerateU to use the CLIP text encoder to map generated object categories from VL-SAM to the predefined class names in LVIS for mAP evaluation by calculating their similarities.
Q2: Performance is worse than open-set methods.
Our method addresses the open-ended detection and segmentation problem, which is different from the open-set problem. As mentioned in Lines 34-39, open-set methods require predefined object categories as inputs during inference. In contrast, open-ended methods do not need to input predefined categories and can predict the object categories themselves. Therefore, it is unfair to compare the performance of our method with open-set methods directly. They are two different tasks.
Moreover, our method is training-free, while open-set methods listed in the paper need additional training.
Q3: Vision language model selection.
As shown in Table 5 of our paper, empirical results demonstrate that VLMs with more powerful multi-modal chat and reasoning capabilities perform better when integrated into VL-SAM. Thus, we suggest using a stronger VLM. Besides, our framework is general. Any VLMs that generate object names from images and provide attention maps can be incorporated.
Q4: typos.
Thanks. We will revise them accordingly.
Thanks for the rebuttal. Since some of the concerns are addressed, I will raise my rating to borderline accept.
We thank the reviewers for their constructive comments. They recognize that our work can "lead to lots of useful applications" (71UW, 6VtQ, bgFd), "achieves good results" (71UW, 6VtQ), and is "well explained" (71UW, 6VtQ) and "effective" (71UW, bgFd). We address the reviewers' concerns in the rebuttal text below.
Dear reviewers,
Thank you for the comments on our paper. We have submitted the response to your comments. Please let us know if you have additional questions so that we can address them by the end of the discussion period (Aug 13, 11:59 pm AoE). We hope that you can consider rasiing the score after we address all the issues.
Sincerely, Authors of submission 9204
The paper received one Borderline Reject and two Borderline Accepts. All reviewers agree that the paper introduces a novel system to combine a VLM with SAM for open-ended object detection. The method adopts the attention map in a VLM as a localized prompt to SAM for object segmentation. 6VtQ and 71UW agree that the combination of a VLM with SAM cloud leads to many useful applications. The main concern of bgFd is that the proposed model, while novel, may not be suitable for real world applications due to its latency on iterative refinements. After considering the review, rebuttal, and manuscript, AC agrees that the proposed method could inspire future works on grounding text outputs of a VLM to localized regions in a training-free way. The novelty outweighs the practical consideration raised by bgFd for deploying the method in real-time applications. Consequently, AC recommends accepting the paper.