PaperHub
5.8
/10
Poster4 位审稿人
最低5最高7标准差0.8
5
5
6
7
3.8
置信度
正确性2.8
贡献度2.5
表达2.5
NeurIPS 2024

Recognize Any Regions

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Open Vocabulary Object Recognition; Zero-shot; Representation Learning

评审与讨论

审稿意见
5

The paper presents RegionSpot, a novel architecture designed for efficient open-world visual region recognition. The primary goal of RegionSpot is to leverage the strengths of powerful pretrained foundation models, specifically a localization model (SAM) and a vision-language model (CLIP), to improve the recognition of individual regions or patches within images. RegionSpot focuses on keeping both foundation models frozen and optimizing only a lightweight attention-based knowledge integration module. This results in significant computational savings and reduced training time. Extensive experiments demonstrate that RegionSpot outperforms state-of-the-art models, showing substantial gains in mean Average Precision (mAP) and especially excelling in recognizing rare and challenging categories.

优点

The combination of SAM and CLIP in a frozen state with a lightweight attention-based module is a unique approach that leverages pre-existing models' strengths.

The approach significantly reduces training time and computational resources, making it more practical for real-world applications.

The model demonstrates substantial improvements over previous methods, particularly in recognizing rare and challenging categories.

缺点

  1. Dependency on Pretrained Models: The approach is heavily reliant on the quality and capabilities of the SAM and CLIP models. Any inherent limitations or biases in these models could impact RegionSpot's performance.

  2. Lack of Novel Algorithmic Innovations: While RegionSpot's integration of SAM and CLIP is innovative, the methodology itself does not introduce fundamentally new algorithms or theoretical advancements in the field of computer vision. The primary contribution lies in the effective use of existing models rather than developing new techniques or algorithms.

  3. Absence of New Training Paradigms: The approach focuses on combining pretrained models in a novel way but does not offer new training paradigms or optimization strategies. This could be seen as a limitation in terms of pushing the boundaries of current methodologies.

问题

  1. This paper lacks a detailed comparison with the state-of-the-art models, and does not provide reference links for the methods used in the comparison,

  2. Can you provide more detailed information on the computational resources required for training and inference? How do these requirements compare to other state-of-the-art models?

  3. Have you analyzed the failure cases where RegionSpot did not perform well? What were the common reasons for these failures?

  4. How significant is the impact of keeping the foundation models frozen during training? Have you experimented with fine-tuning these models to assess any potential performance gains?

  5. Have you explored the impact of using different types of prompts or additional features in your model?

局限性

  1. The authors should provide a more detailed discussion on the limitations of relying on specific pretrained models. They could explore scenarios where these models might fail or underperform, such as in completely novel domains or with significantly different data distributions.

  2. Identifying and discussing potential performance bottlenecks within the RegionSpot architecture would be beneficial. This includes the integration module and cross-attention mechanism, which might limit scalability or introduce latency in real-time applications.

  3. The evaluation is conducted on a few specific benchmarks. Broader evaluation across different datasets and tasks would provide a more comprehensive understanding of the model's capabilities and limitations.

  4. The method relies on external region proposals or ground truth bounding boxes. Integrating end-to-end learning for region proposal and classification could further improve efficiency.

作者回复

We thank the reviewer for your insightful comments.

Q1: Dependency on Pretrained Models.

R1: Many thanks for these great comments. It has been a trending research focus in AI towards integrating rich, pre-trained models to enhance a target task, particularly when these foundation models get stronger and also heavier [1-3]. The motivations are manifold, e.g., using pre-trained models saves computational power and time, making powerful models accessible without extensive resources. This is responsible, green, and scalable. Our work falls in this realm. Additionally, RegionSpot is not limited to CLIP and SAM in general, though they are selected in our implementation due to their salient performances. RegionSpot is very flexible for integrating more advanced ViL and localization foundation models, such as InternVL[4], SAM 2[5].

[1]BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ICML2023

[2]Visual Instruction Tuning NIPS2023

[3]Adding Conditional Control to Text-to-Image Diffusion Models ICCV2023

[4]InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR2024

[5]SAM 2: Segment Anything in Images and Videos arxiv

Q2: Lack of Novel Algorithmic Innovations and Absence of New Training Paradigms

R2: Introducing fundamentally new architectures/networks and training paradigms is one format of algorithmic innovation but not the only way. We consider our approach of leveraging existing foundation models for achieving superior region recognition to be significant as fine-grained region understanding is an essential requirement in computer vision. This also advocates reuse of computing resources, avoiding the need to develop task-specific foundation models. So we argue this work matters.

Q3: This paper lacks a detailed comparison with the state-of-the-art models, and does not provide reference links for the methods used in the comparison.

R3: We have provided detailed comparisons of training data, training time, learnable parameters and performance in the Table 1, 2 and appendix. We will further include additional information, such as inference time, in the final version. Additionally, we will add reference links to every table.

Q4: Can you provide more detailed information on the computational resources required for training and inference? How do these requirements compare to other state-of-the-art models?

R4: Thanks. We have provided the training data and training time in Table 1 and appendix.

To demonstrate the efficiency of our proposed RegionSpot, we compared the training and inference speeds of RegionSpot with GroundingDINO-T and GLIP-T on the zero-shot object detection benchmark on LVIS. We analyzed model performance, training time in GPU hours, and inference latency using the same hardware, NVIDIA V100 GPU. Since GroundingDINO does not provide the training code and LVIS evaluation code, we only tested latency by modifying their code, referring to the provided simple inference code.

As shown in Table 1, compared to GLIP, we achieve a 460x speed-up in training time. Additionally, RegionSpot achieves 6.5 FPS (0.15 s/image) on a single V100 during inference on LVIS, including all component processes like RPN, SAM, and CLIP. In contrast, GLIP-T and GroundingDINO-T achieve only 0.2 FPS (5 s/image) and 0.14 FPS (7.1 s/image), respectively, due to their visual-text concept alignment through sequential formulation and early fusion. Despite using two foundation models, our inference speed outperforms GLIP and GroundingDINO due to: (1) low-resolution inputs for CLIP, (2) parallel region-text token formulation from CLIP, (3) parallel multi-prompt processing in the SAM decoder, (4) a lightweight decoder, and (5) a faster RPN proposal generator. Note that in the paper, for fair comparison, we used the same proposals as GLIP. However, one can utilize any proposal generator in practice. The above revision would improve our work clearly thanks to the reviewer’s feedback. We will clarify.

Table 1: Efficiency comparison on LVIS val v1.0.

MethodTraining(GPU Hours)Inference(FPS)AP_r
Grounding DINO-T[1]-0.2-
GLIP-T[2]92.1K0.1410.1
RegionSpot+RPN0.2K6.514.2

[1]Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection ECCV2024

[2]Grounded Language-Image Pre-training CVPR 2022

Q5: Have you analyzed the failure cases where RegionSpot did not perform well? What were the common reasons for these failures?

R5: Great question. As shown in the Table 1, the recognition ability drops when going from ground-truth boxes to SAM proposals or GLIP boxes, same for all the region recognition methods including ours. That means the accuracy of object localization still matters to performance, though existing localization methods are already strong with some room out there for further improvement. We will add more visualization analysis.

Q6: How significant is the impact of keeping the foundation models frozen during training? Have you experimented with fine-tuning these models to assess any potential performance gains?

R6: Great question. Fine-tuning these foundation models is not attempted as that will break the zero-shot segmentation ability from SAM due to catastrophic forgetting, as verified by HQ-SAM[1] and F-VLM[2] , along with getting the training process more resource intensive. We will clarify.

[1]Segment Anything in High Quality NIPS2023

[2]F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models ICLR2023

Q7: Have you explored the impact of using different types of prompts or additional features in your model?

R7: Great question. We have performed prompt engineering ablation study as shown in Table 7(a) in the main paper. Also, we explored different CLIP feature styles and position-aware tokens from SAM to verify RegionSpot in Table 6 (a) and (b) in the main paper. We will further highlight.

评论

I have thoroughly read the authors' responses and the comments from other reviewers. Thank you for the detailed answers to my questions. I am willing to upgrade my vote to "Borderline accept."

评论

We thank the reviewer for recommending acceptance. We appreciate the constructive discussion and will incorporate it into our final version.

审稿意见
5

To address open-world object detection, this paper proposes RegionSpot, which combines the localization capabilities of SAM with the classification strengths of CLIP. RegionSpot integrates position-aware tokens from SAM with image-level feature maps extracted from CLIP, creating region-level semantic tokens. These tokens are then aligned with text representations to enhance recognition accuracy. RegionSpot achieves state-of-the-art performance on the LVIS and ODinW benchmarks.

优点

  1. The approach of forming position-aware tokens from SAM, and the way they interact with CLIP features are innovative. These tokens, containing localization features, should enhance detection capabilities.
  2. RegionSpot demonstrates substantial performance improvements compared to other methods and baselines across various settings.

缺点

  1. While effective, the method by which RegionSpot uses position-aware tokens is somewhat implicit. It is not entirely clear how these localization features directly contribute to the performance gains.
  2. Combining CLIP and SAM for detection, although effective, is relatively straightforward. Despite some non-trivial modifications, the overall novelty of the approach may be perceived as limited.
  3. A small typo in line 166 "Zero short inference"

问题

  1. Figure 4 illustrates that the "position-aware token aligns effectively with the semantic feature map of the entire image." How do the tokens from SAM contain more effective localization features compared to tokens from other localization models, such as Mask-RCNN or CAM from CLIP? In other words, why was SAM chosen over other localization models?
  2. Instead of using the pipeline of RegionSpot, if I obtain all masks using SAM, forming a bounding box for each instance, then use CLIP to conduct zero-shot classification for detection, how are the results? And why the design of RegionSpot could outperform such a baseline?

局限性

Yes. The authors claim that "While our method advances open world region understanding, it still (does) not unleash potential capabilities from the fundamental models, such as the automatic localization ability from SAM, which could reduce reliance on external region proposal mechanisms for object detection and enhance versatility."

作者回复

We thank the reviewer for your insightful comments.

Q1: While effective, the method by which RegionSpot uses position-aware tokens is somewhat implicit. It is not entirely clear how these localization features directly contribute to performance gains.

R1: We summarize the reason for performance gain again. RegionSpot utilized the position-aware token to query the semantic information extracted from a ViL model since the position token already learned the position-aware information about the region from the SAM pretrained model and context information outside of individual regions, which could provide extra cues for helping recognition while aligning with CLIP features. Meanwhile, the ViL feature map contains semantic information about the whole image. Hence, they only need light weight connectors that can bridge the region level position-aware information and image level semantic information to achieve the open world region level understanding. Additionally, we did the ablation to verify that different location tokens from SAM present different effects. Please check Table 6 (b) in the main paper. We will further claim these in the final version.

Q2: Combining CLIP and SAM for detection, although effective, is relatively straightforward. Despite some non-trivial modifications, the overall novelty of the approach may be perceived as limited.

R2: Apologies for this misunderstanding. It has been a trending research focus in AI towards integrating rich, pre-trained models to enhance performance. The motivations are manifold, e.g., using pre-trained models saves computational power and time, making powerful models accessible without extensive resources. This is responsible, green, and scalable. Our work falls in this realm. We note that our novelty is not introducing new architectures (e.g., cross attention) or simply assembling a pipeline of SAM and CLIP which is our baseline in Table 1 of the main paper. Instead, our key idea is to optimize efficient open-world region understanding by leveraging existing foundational models. In this study, we use the ViL foundational model CLIP and the localization foundational model SAM to validate our approach. We verify the effectiveness of RegionSpot through extensive testing on various tasks and datasets in a zero-shot manner. Additionally, we conducted ablation studies to explore how to fully leverage the pretrained capabilities of these foundational models. We will further clarify these points in the final version.

Q3: Figure 4 illustrates that the "position-aware token aligns effectively with the semantic feature map of the entire image." How do the tokens from SAM contain more effective localization features compared to tokens from other localization models, such as Mask-RCNN or CAM from CLIP? In other words, why was SAM chosen over other localization models?

R3: Great question, thanks. Although Mask-RCNN or CAM from CLIP only has the coarse ability to identify the object, they are generally inferior in localizing the objects in the wild. Contrastly, SAM trained with billion scale prompt-mask pairs, is capable of segmenting/localizing a wide range of visual structures in diverse scenarios, by taking a prompt consisting of points, a bounding box as input. Its zero-shot segmentation abilities have led to a rapid paradigm shift. As SAM adopts the DETR architecture, the prompt token from the SAM already has the position aware information to query the object. Hence we choose it as the localization foundation model. But our method is not limited to SAM, also can apply other SAM-like models, such as HQ-SAM[1] and SAM 2 [2]. We will further clarify.

[1] HQ-SAM: Segment Anything in High Quality NIPS2023

[2] SAM 2: Segment Anything in Images and Videos arxiv

Q4: Instead of using the pipeline of RegionSpot, if I obtain all masks using SAM, forming a bounding box for each instance, then use CLIP to conduct zero-shot classification for detection, how are the results? And why the design of RegionSpot could outperform such a baseline?

R4: In our submission, we have already provided this suggested baseline obtaining SAM output mask and then crop the region part feed to CLIP in Table 1. Instead of individual cropping regions from an image, RegionSpot uniquely uses the position aware token from SAM to query the corresponding semantic feature from the whole-image CLIP feature map using the cross attention. This enables RegionSpot to model both object content within each region and context information outside the region – the latter cannot be leveraged by this SAM+CLIP baseline. That explains why our method excels .

Q5: Typo mistakes.

R5: Thanks, we will fix all in the revision.

评论

I've thoroughly reviewed the authors' responses and appreciate their thoughtful engagement. Most of my concerns have been addressed. I will stay in touch for further discussion as we approach the final rating.

评论

Thank you for confirming that your concerns have been largely addressed. We appreciate your willingness to engage in further discussion and are here to provide any additional information or clarification you may need.

审稿意见
6

The paper proposed a method for open-world object detection, which utilises the segment anything model (SAM) to produce region priors and the CLIP model to extract image and language features. The region priors from SAM, which are implicitly encoded in the query tokens, are used in a learnable transformer decoder to perform cross-attention with the image-level features extracted using CLIP. The decoder is trained in a contrastive manner such that the query tokens are matched with the corresponding language embeddings. The proposed method demonstrated strong performance on the challenging LVIS dataset, with less training time.

优点

  1. The proposed method combines the regions priors from the SAM model with the feature extraction capability of the CLIP model, and eliminates the need of training a region proposal network, which is shown to speed up the training process.

  2. A opposed to cropping out image regions that contain objects and extracting region features, the proposed method exploits the implicit region priors from SAM and use the corresponding query tokens to produce detections. This method only computes the image-level features once and eliminates the repetitive computation caused by overlapping bounding boxes, analogous to the improvement from R-CNN to Fast R-CNN.

缺点

  1. The main advantage of the proposed method seems to be the low training time, which is somewhat less important compared to inference speed. The proposed model employs two foundation models, which will most likely result in very slow inference speed. Yet the paper did not include any details around this.

  2. The paper could benefit from some more insights on what kind of region priors the position-aware tokens encode. For instance, object detection models such as conditional-DETR and DAB-DETR have revealed that the interaction between the queries and the positional embeddings of the image features is key to localising the object. As such, one would expect that the position-aware tokens may have high similarity with the sinusoidal positional embeddings around the regions that contain the object.

问题

  1. What is the training time in Table 1 measured by?

  2. What is the inference speed of the model?

  3. How would the model compare against SAM itself? As I understand, the segmentation masks could be easily converted into bounding boxes based on the boundary pixels. Furthermore, segmentation is essentially a harder task than object detection. As SAM already has the capability of detecting objects (characterised by masks instead of boxes) using different types of prompts, what is the advantage of the proposed method?

局限性

The authors did not discuss the inference speed of the proposed model, which I suspect will be a significant issue as it employs two foundation models.

作者回复

We thank the reviewer for your insightful comments.

Q1: The main advantage of the proposed method seems to be the low training time, which is somewhat less important compared to inference speed. The proposed model employs two foundation models, which will most likely result in very slow inference speed. Yet the paper did not include any details around this.

R1: Apologies for the omission of inference speed data. To demonstrate the efficiency of our proposed RegionSpot, we compared the training and inference speeds of RegionSpot with GroundingDINO-T[1] and GLIP-T[2] on the zero-shot object detection benchmark on LVIS val v1.0. We analyzed model performance, training time in GPU hours, and inference latency using the same hardware, NVIDIA V100 GPU. Since GroundingDINO does not provide the training code or LVIS evaluation code, we only tested latency by modifying their code, referring to the provided simple inference code.

As shown in Table 1, compared to GLIP, we achieve a 460x speed-up in training time. Additionally, RegionSpot achieves 6.5 FPS (0.15 s/image) on a single V100 during inference on LVIS, including all component processes like RPN, SAM, and CLIP. In contrast, GLIP-T and GroundingDINO-T achieve only 0.2 FPS (5 s/image) and 0.14 FPS (7.1 s/image), respectively, due to their visual-text concept alignment through sequential formulation and early fusion. Despite using two foundation models, our inference speed outperforms GLIP and GroundingDINO due to: (1) low-resolution inputs for CLIP, (2) parallel region-text token formulation from CLIP, (3) parallel multi-prompt processing in the SAM decoder, (4) a lightweight decoder, and (5) a faster RPN proposal generator. Note that in the paper, for fair comparison, we used the same proposals as GLIP. But one can utilize any proposal generator in practice. The above revision would improve our work clearly thanks to the reviewer’s feedback.

Table 1: Efficiency comparison on LVIS val v1.0.

MethodTraining(GPU Hours)Inference(FPS)AP_r
Grounding DINO-T[1]-0.2-
GLIP-T[2]92.1K0.1410.1
RegionSpot+RPN0.2K6.514.2

[1] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection ECCV2024

[2] GLIP: Grounded Language-Image Pre-training CVPR 2022

Q2: The paper could benefit from some more insights on what kind of region priors the position-aware tokens encode. For instance, object detection models such as conditional-DETR and DAB-DETR have revealed that the interaction between the queries and the positional embeddings of the image features is key to localising the object. As such, one would expect that the position-aware tokens may have high similarity with the sinusoidal positional embeddings around the regions that contain the object.

R2: Thank you for these insightful comments. We agree with the reviewer on the importance of understanding the kind of region prior information encoded by the position-aware tokens. To explore this, we conducted an ablation study analyzing output tokens from SAM at various locations. The study concluded that generating the output token after the Transformer decoder yields the best performance, as it not only encodes the coordinate position information but also incorporates semantic position information. As illustrated in Figure 4 of the main paper, we also visualized the similarity between the position-aware token and the CLIP feature. As expected, the regions containing objects in the CLIP feature showed higher similarity. We appreciate the reviewer's suggestion and have included the above discussion in the final version.

Q3: What is the training time in Table 1 measured by?

R3: We measured the training time in GPU hours using the same V100 hardware. We will further clarify this in the final version.

Q4: How would the model compare against SAM itself? As I understand, the segmentation masks could be easily converted into bounding boxes based on the boundary pixels. Furthermore, segmentation is essentially a harder task than object detection. As SAM already has the capability of detecting objects (characterised by masks instead of boxes) using different types of prompts, what is the advantage of the proposed method?

R4: Thanks for your question. We agree that segmentation is a harder task than object detection. However, SAM only supports simple visual prompts, such as the points and boxes, and outputs the class-agnostic mask.

If we want to achieve the class-aware mask, one way is to pass the segmented region to CLIP to achieve the zero-shot region recognition. However, utilizing individual cropped regions only leads to the loss of crucial contextual information, which can hinder recognition performance. Meanwhile, there is often a big gap between the current task (i.e., region level understanding) and pretrain task (i.e., image level understanding). In our method, instead of cropping the regions from an image, we utilize mask tokens from SAM, which has strong position-aware information,to find the corresponding semantic details from the ViL feature map, enhancing the semantic understanding at a regional level. Our method can unleash the pretrained power from the pretrained foundation model without the need for training from scratch as required by the previous works [1]. Our design not only achieves superior region recognition accuracy but also is more efficient computationally and in training data collection.

[1] RegionCLIP: Region-based Language-Image Pretraining CVPR2022

评论

Dear Reviewer,

Thank you for your time and expertise in reviewing our paper and participating in the rebuttal process. Your feedback has greatly improved our work. We hope our rebuttal has addressed your concerns. If any issues remain, we are ready to discuss them further. Given the review timeline, we would appreciate your prompt review of our revised responses. If our clarifications have resolved your concerns, we kindly request reconsideration of the initial rating. If not, we welcome further discussion. Thank you again for your thoughtful consideration.

审稿意见
7

The paper introduces RegionSpot, a compute-efficient method that leverages localization foundation models (such as SAM) with semantic information from a ViL model (such as CLIP). RegionSpot is demonstrated on multiple scenarios and achieve better results than baseline methods while being much faster to train than others.

优点

  1. Computational Efficiency: RegionSpot does not require large computational resources to train region identification model. RegionSpot keeps both foundation models (SAM and CLIP) frozen, focusing optimization efforts solely on a light weight attention-based knowledge integration module.

  2. Encoding region-level and image-level visual knowledge with text-annotations: RegionSpot cleverly utilizes region-level knowledge from SAM and it use image-level information with ViL (such as CLIP). This allows RegionSpot to capture more contextual information as compared to other baseline methods.

  3. The paper presents diverse analysis showing the effectiveness of proposed method, RegionSpot.

缺点

  1. RegionSpot use SAM. Effectively, RegionSpot could be used for object detection by optimizing SAM properly. In its current version, RegionSpot is restricted to identifying regions given a region proposal or a bounding box.

问题

  1. Table 1: While I understand the drop in performance when going from GT boxes to SAM proposals and GLIP boxes -- I am wondering if there is a way to evaluate the results on the basis of classification criterion then detection? RegionSpot is primarily about naming region. Penalizing it for a stricter detection may not be a correct idea.

局限性

Not explicitly stated.

作者回复

We thank the reviewer for their insightful comments.

Q1: RegionSpot use SAM. Effectively, RegionSpot could be used for object detection by optimizing SAM properly. RegionSpot is restricted to identifying regions given a regional proposal or a bounding box.

R1: Many thanks for these great comments. We highlight the following points:

  1. Our method preserves SAM's flexible prompting capability and pretrained knowledge by keeping SAM frozen instead of fine-tuning it to save costs. This approach allows RegionSpot to maintain interactive region recognition, extract region-specific semantic features, perform open vocabulary object detection, and effectively segment regions.
  2. Similar to our approach, recent works like RegionCLIP[1] and DetPro[2] also use external region proposals, as recognizing regions of interest is the focus for all. As demonstrated in ViLD[3], F-VLM[4], and Groma[5], existing regional proposal methods are already highly adaptable and directly applicable across different domains. However, recognizing the detected regions presents more challenges and is the bottleneck. That is why we focus on addressing this aspect. We will further stress.

[1] RegionCLIP: Region-based Language-Image Pretraining CVPR2022

[2] DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model CVPR2022

[3] ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation ICLR2022

[4] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models ICLR2023

[5] Groma: Grounded Multimodal Large Language Model with Localized Visual Tokenization ECCV2024

Q2: More metrics to evaluate RegionSpot.

R2: Many thanks. As suggested, we use the classification metric, Accuracy, to evaluate RegionSpot. Same as the main paper, we use masks to crop the region. As shown in Table 1, our models maintain superior performances over the CLIP baseline by a large margin. We will add this evaluation.

Table 1: Comparison with Accuracy metric, * indicate finetune the CLIP with Adapter.

proposalsAccuray
CLIP-L_↑336 w/ mask_GT44.7%
CLIP-L_↑336* w/ mask_GT58.1%
RegionSpot-Pro ↑336GT68.2%
评论

Dear Reviewer,

Thank you for your time and expertise in reviewing our paper and participating in the rebuttal process. Your feedback has greatly improved our work. We hope our rebuttal has addressed your concerns. If any issues remain, we are ready to discuss them further. Given the review timeline, we would appreciate your prompt review of our revised responses. If our clarifications have resolved your concerns, we kindly request reconsideration of the initial rating. If not, we welcome further discussion. Thank you again for your thoughtful consideration.

评论

Thanks for the rebuttal. It is very helpful.

评论

We thank the reviewer for recommending acceptance. We appreciate the valuable feedback and will incorporate it into the final version of our paper.

最终决定

The paper proposes a novel and efficient method for open-world visual region recognition, leveraging the strengths of pre-trained foundation models like SAM and CLIP.
The reviews are positive given the paper's computational efficiency, innovative approach, and strong empirical results, particularly in recognizing rare and challenging object categories.
While some concerns were raised regarding the reliance on pre-trained models and the lack of novel algorithmic contributions, the authors' rebuttal effectively addressed these points, highlighting the practical significance of their approach and its potential for broader impact.
Considering the positive reviews, the authors' thorough responses, I recommend accepting this paper.