Open-Vocabulary Object Detection for Incomparable Spaces

Masoumeh Zareapoor,Pourya Shamsolmoali

OpenReview PDF

提交: 2024-09-27更新: 2024-12-02

摘要

关键词

Multimodal learningobject detection

评审与讨论

审稿意见

评分: 6置信度: 32024-11-02

The paper introduce an approach for open-vocabulary object detection (OVDet) that aligns relational structures across visual and textual data to enhance the detection of objects, especially unseen or rare objects. The authors propose a model called VOCAL (Vocabulary Alignment Classifier) that shifts from feature fusion to relational alignment, bridging the gap between visual and textual inputs. VOCAL leverages both text descriptions and image examples to identify objects, addressing limitations such as lexical ambiguity, lack of visual specificity, and unknown class names. The evaluation on challenging datasets shows that VOCAL outperforms existing OVDet models and even surpasses fully-supervised detectors in detecting rare objects.

优点

The paper presents a sophisticated method for OVDet that focuses on relational alignment between visual and textual data, which is a novel approach in the field.
VOCAL demonstrates superior performance in detecting rare objects and outperforms existing OVDet models, which is a significant achievement.
The model demonstrates a new benchmark in detecting rare objects and outperforms existing OVDet models, which is a substantial achievement.

缺点

The approach may be more complex and computationally intensive than simpler fusion methods, which could be a limitation in resource-constrained environments.
The introduction of the Image and Text Encoder results in a detection process that requires more computation, and fairness compared to other OVDet methods needs to be considered.
Some related OVDet methods are missing. For example, Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection ICCV 2023.

问题

See Weaknesses. My major concern is introducing much more complexity compared with previous methods.

审稿意见

评分: 5置信度: 42024-11-02

This paper introduces a novel open-vocabulary detection method that utilizes both textual and visual classifiers, integrating them through feature-level alignment and relational alignment. The author conducts experiments on LVIS to demonstrate its performance on novel categories.

优点

The feature-level alignment and relational alignment for fusing textual and visual classifiers is very interesting.
The weighted contextual embeddings and prototype discovery respectively optimize the methods for constructing textual and visual classifiers.

缺点

The method’s pipeline is similar to MMOVD[1], with only minor improvements made to the construction and fusion of classifiers. Overall, the novelty might be quite limited.
The method is similar to MMOVD, but lacks critical experiments comparing it with MMOVD, such as evaluations using IN-LVIS as extra data on the LVIS dataset, and MMOVD’s evaluations on cross-dataset transfer detection.
There are missing experiments that prove the effectiveness of the method. 1) Lack of experiments demonstrating that weighted contextual embeddings improve the performance of a text-based classifier compared to simply averaging; 2) Lack of experiments showing that using feature-level alignment and relational alignment is more effective compared to naive fusion strategies like addition.
The comparison experiments between V-CLS and V-Mean are not reasonable. V-CLS, compared to V-Mean, uses both the prototype discovery strategy and additional transformer blocks as the Visual Aggregator. This setup does not validate the effectiveness of the prototype discovery strategy. According to MMOVD[1], using a Visual Aggregator already performs better than directly averaging various visual embeddings. V-CLS should be compared with a Visual Aggregator that does not use the prototype discovery strategy.
There is a lack of hyperparameter analysis for $\lambda$ and $\alpha$ .
Results of open vocabulary object detection evaluations on the COCO dataset are missing.

[1] Multi-Modal Classifiers for Open-Vocabulary Object Detection, ICML 2023

问题

Figure 2 attempts to demonstrate the model’s effectiveness in detecting rare categories, but the examples provided belong to either the frequent or common categories, which does not prove the model’s capability in detecting rare categories. For instance, ‘knife’, ‘skateboard’, ‘belt’, ‘pillow’, and ‘bicycle’ are all frequent categories, while ‘rhinoceros’, ‘goose’, ‘kiwi’, and ‘gull’ belong to common categories.
Please refer to the weakness section.

审稿意见

评分: 5置信度: 32024-11-03

This paper addresses the challenges of open-vocabulary object detection (OVDet), where the goal is to detect objects at inference time that were not seen during training. The authors propose an approach called VOCAL (Vocabulary Alignment Classifier), which integrates visual and textual embeddings by aligning both feature-level and relational structures across these two modalities. This method aims to bridge the gap between visual and textual data, enabling robust detection even when input from one modality is weak or ambiguous.

优点

It combines textual descriptions and visual examples to identify objects, leveraging the strengths of both modalities to improve detection accuracy.
Instead of simple feature fusion, VOCAL focuses on aligning the contextual relationships between objects in text and images, which is a novel way to handle the misalignment problem in heterogeneous data. The model can adapt to new categories or unseen objects without retraining, which is a significant advantage in dynamic environments where new objects frequently appear.
The evaluation shows that the model outperforms existing OVDet models, setting new benchmarks in detecting rare objects.

缺点

The method involves complex alignment mechanisms that could be computationally expensive and may require substantial resources for training and inference.
The performance of VOCAL heavily relies on the quality of the text and image embeddings. If the embeddings are not representative, the alignment may not be effective.
While the model can adapt to new categories, the scalability to a very large number of categories or extremely rare objects is not explicitly discussed and could be a challenge.
Although the paper mentions cross-dataset transfer, the generalization of the model to datasets outside of the trained domain is a potential concern that may require further validation.

问题

Pls see the weeknesses above.

撤稿通知

2024-12-02

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.