PaperHub
5.0
/10
Poster3 位审稿人
最低4最高6标准差0.8
5
6
4
4.0
置信度
正确性2.7
贡献度2.7
表达2.7
NeurIPS 2024

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
video instance segmentationvideo object captioningopen-world video instance segmentation

评审与讨论

审稿意见
5

The authors propose a new task called Open-World Video Instance Segmentation and Captioning (OW-VISCap), which involves detecting, segmenting, tracking, and describing both seen and unseen objects. To tackle this problem, they introduce two networks: an object abstractor for encoding images at the object level and an object-to-text abstractor for generating captions for objects. In the object abstractor, they use evenly distributed points as input to obtain open-world object queries, in addition to closed-world object queries, to handle unseen objects. The object-to-text abstractor is composed of standard transformer blocks, but it employs masked attention in the cross-attention layer to focus on objects. The proposed method demonstrates superior performance in Open-World Video Instance Segmentation (OW-VIS) and Dense Video Object Captioning (DVOC) compared to existing approaches.

优点

  • The authors propose a new task and corresponding solution for video understanding.
  • They provide a thorough analysis of existing methods in the fields of OW-VIS and DVOC.
  • Overall, the paper is well-organized and easy to read.

缺点

Despite being the first proposed method for this task, the technical contribution of the proposed approach seems weak. Moreover, the evaluation and analysis appear to lack in several areas:

  1. Evaluation on Limited Benchmarks: The proposed method has been evaluated on only one benchmark each for OW-VIS and DVOC. To verify its open-world capability, additional benchmarks are needed. Other works like OV2Seg [1] and DVIS++ [2] use a variety of datasets, including LV-VIS, YouTube-VIS, and OVIS, in addition to BURST.

  2. Experimental Evidence for Free-Form Captions: In lines 33-37, the use of discrete class labels in current OW-VIS methods is cited as an issue. However, there is no experimental evidence showing that using free-form captions improves OW-VIS performance.

  3. Performance in Table 5: The basic VIS performance appears to be lacking. Recent VIS methods [2-5] achieve over 35 AP on the OVIS benchmark with a ResNet-50 backbone. Even considering the open-world setting, the proposed method's performance gap compared to the latest works in a closed-world setting is too large.

References

[1] Wang, Haochen, et al. "Towards open-vocabulary video instance segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Zhang, Tao, et al. "DVIS++: Improved decoupled framework for universal video segmentation." arXiv preprint arXiv:2312.13305 (2023).

[3] Heo, Miran, et al. "A generalized framework for video instance segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[4] Li, Junlong, et al. "TCOVIS: Temporally consistent online video instance segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[5] Ying, Kaining, et al. "CTVIS: Consistent training for online video instance segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

问题

Please refer to the Weaknesses section

局限性

They discuss limitations of the proposed method in the Appendix.

作者回复

Despite being the first proposed method for this task, the technical contribution seems weak.

Please see the “Technical Novelty” section in the comment addressed to all reviewers.

1. Evaluation on Limited Benchmarks ... Other works like OV2Seg [1] and DVIS++ [2] use a variety of datasets, including LV-VIS, YouTube-VIS, and OVIS, in addition to BURST.

We would like to point out that, different from our “open-world” method, OV2Seg and DVIS++ are “open-vocabulary” methods. These methods assume a finite set of novel categories during evaluation. Each of these object categories is encoded to obtain text embeddings which are then used to obtain probability scores for open vocabulary classification. In our open-world setting, a finite set of novel categories that we can individually encode doesn’t exist. Our goal is not to classify novel objects, but to generate free-form captions. Testing our approach in the open-vocabulary setting is hence not meaningful.

We have evaluated our model on the OVIS dataset for the closed-world task (see Appendix, Tab. 5).

2. ... There is no experimental evidence showing that using free-form captions improves OW-VIS performance.

Our goal is not to use the generated captions to improve OW-VIS performance. The goal is to describe discovered objects more holistically via free-form captions, instead of just one-word labels for closed-world objects or no labels for open-world objects.

3. The basic VIS performance appears to be lacking. ... performance gap compared to the latest works in a closed-world setting is too large.

Please see point (c) in the “Experimental Performance” section of the comment addressed to all reviewers.

评论

Dear Reviewer W8BT, as the NeurIPS AC, I would like to remind you to check and read the rebuttal provided by the authors. After that, please respond or acknowledge authors ' rebuttal and update your rating if necessary. Many thanks.

评论

Thank you for the responses and for pointing out my misunderstanding regarding the terms "open-world" and "open-vocabulary." Based on the feedbacks, I think this work is valuable as an initial study on the proposed problem. I will raise the score to a BA.

评论

Thank you for increasing the score! Your feedback is encouraging! We appreciate your support as we think this novel task is valuable and important for our community. We hope this first step in addressing this task will spur more research in this direction by others too.

审稿意见
6

This paper propose a new task and the corresponding model: detecting, tracking, segmenting, and captioning open-vocabulary objects in a video. The authors proposed a novel online framework that contains an object detector and feature detector (Object abstractor), a Mask2former style segmentation head, and a frozen LLM. The resulting model is able to detect objects both seen and unseen in the training data, and produce captions. Experiments on open-vocabulary video instance segmentation and dense video object captioning show the proposed model outperform the corresponding SOTA or baselines.

优点

  • This paper proposed an important and missing new task: segment and captioning all objects in videos. I believe this is one of the ultimate understanding task for video, and this paper defined the task, and composed SOTA components as a novel model for the task.

  • I appreciate the workload the authors put in composing the model for such a complex task. The model designs of using Mask2Former architecture and a frozen LLM makes sense to me.

  • Experimental results are reasonable. It shows the proposed method performs better than other models on unseen objects, and produce better captions than alternative stage-wise models or concurrent specialist model. The components are ablated in Table 2 and Table 3.

缺点

  • From Figure 3, it is unclear how "video" is handled. Do you feed object queries in the previous frame to a new frame, to retain their identity? If so, would the object caption in different frames be different for the same identity?

  • While this paper propose a new task, the task is evaluated separately on two sub-tasks on different datasets: OW-VIS and Dense VOC, and thus the overall task is never evaluated. I understand curating evaluation data for this (annotation-heavy) task is expensive, but having some data, even if they are small, will help followup works on this direction. Note this is NOT required in the rebuttal.

  • The layout of the paper can be improved. E.g., Table 1 is in page 6, but is first referred to in the text at page 8. The paper mentions an frozen LLM is used, but only in the very end of the supplementary the authors unveil which LLM: an OPT-2.7B. I think which LLM is an important model information and should be mentioned in the main text. It is also hard to fine the exact training data to train the model.

问题

Overall this paper proposed a valid and novel model for a new important task, with good comparisons to existing methods. My concerns are mostly on clarification or presentation, and I hope the authors can clarify/ improve them in the rebuttal. My current rating is a weak accept.

局限性

Yes.

作者回复

From Figure 3, it is unclear how "video" is handled. ..., would the object caption in different frames be different for the same identity?

Yes, we feed object queries in the previous frame to new frames as proposed in CAROQ. We explain this in Appendix C. Object captions may differ in different frames. We think that this setting provides the flexibility to handle multiple action segments of the same object within a video. However, we noticed that if the video frames don’t change much, captions remain mostly consistent.

The task is evaluated separately on OW-VIS and Dense VOC, the overall task is never evaluated. ... NOT required in the rebuttal.

Thanks for the suggestion, we completely agree. However, overall task evaluation requires a lot of effort as you rightly pointed out. Our future efforts include the development of such a dataset.

The layout of the paper can be improved. ... hard to find the exact training data to train the model.

Thanks for the suggestion! We will revise the paper accordingly. We obtain our training data following prior work. We fine-tune our models on the BURST and the Dense VOC datasets. The procedure to obtain the BURST data is straightforward, following their official website. For the Dense VOC data, we follow the steps mentioned in DVOC-DS.

A valid and novel model for a new important task, with good comparisons to existing methods...

We are thrilled that you found our work novel and important. Thanks again for the valuable suggestions! We will revise our paper based on these comments.

评论

Thank the authors for the rebuttal. My confusion on object identities is resolved. I thus keep my positive rating.

I respectfully disagree with other reviewers' concern on novelty. To me proposing a working system for video object captioning (with segmentation) is a good enough contribution, as this is important and is missing from the market.

评论

Thank you very much! We are thrilled that you found our contributions novel and important. We hope this first step in addressing this task will spur more research in this direction by others as well.

评论

Dear Reviewer 9nUt, as the NeurIPS AC, I would like to remind you to check and read the rebuttal provided by the authors. After that, please respond or acknowledge authors ' rebuttal and update your rating if necessary. Many thanks.

审稿意见
4

This paper introduces a task called “open-world video instance segmentation and captioning”, which combines open-world video instance segmentation (OW-VIS) and video object captioning tasks. To achieve better performance, the authors propose two key components: an object abstractor to identify new objects using a prompt encoding, and an object-to-text abstractor that bridges object queries with a frozen large language model (LLM) to generate captions.

优点

  1. The object-to-text abstractor represents a strategic integration of visual data processing with LLM, which enhances the richness and accuracy of the generated captions.
  2. The paper is well-written and easy to follow.

缺点

**Method

  1. The introduced task is the combination of the existing ones (i.e., OW-VIS and video object captioning), which is not fundamental and novelty enough.

  2. For the first challenge (Lines 33-37), some existing vision-language models (VLMs) also focus on free-form captions for instance segmentation (i.e., open-vocabulary segmentation) and object-oriented captioning (e.g., [1]). [1] Dense Video Object Captioning from Disjoint Supervision, 2023.

  3. The technical novelty of the proposed method seems limited. Specifically,

a) the proposed object abstractor is based on the framework of the video instance segmentation model only with a slight modification (i.e., inter-query contrastive loss).

b) the proposed object-to-text abstractor is a captioning model but focuses on describing the objects in the input videos.

**Experiment

  1. The experiments can be divided into two parts: video instance segmentation and object-oriented captioning. For segmentation, it would be advantageous to utilize more widely used validation sets such as YouTube-VIS. For captioning, besides METEOR, incorporating additional standard captioning metrics like CIDEr and ROUGE would enhance the evaluation.

  2. The improvements are slight.

a) In Table 1, the proposed method does not achieve SoTA results on more than half of the evaluation metrics compared to baselines, which does not convincingly demonstrate its performance.

b) In Table 3, why do the DetA (56.1) and AssA (54.0) metric values remain unchanged, regardless of the addition or removal of components in the ablation study?

问题

Please refer to the Weaknesses.

局限性

This paper discusses the limitations in Section G of the supplementary.

作者回复

1. The introduced task ... not fundamental and novel enough.

We think that OW-VISCap is a novel task that identifies an important gap in existing literature. We agree with reviewer 9nUt: OW-VISCap is “one of the ultimate understanding tasks for video”. Fine-grained object captioning in videos is crucial, especially in an open-world setting. However, existing works like DVOC-DS only focus on closed-world objects and on bounding box-based captioning of only a single action segment per object in a video.

2. ... (Lines 33-37), some existing VLMs also focus on free-form captions (i.e., open-vocabulary segmentation) and object-oriented captioning (e.g., [1]).

Thanks for pointing out the ambiguity in lines 35-37. We will reword these lines in the revised version. However, please note that open-vocabulary segmentation methods need to encode all object categories (seen or unseen) to obtain text embeddings first, which are used to obtain probability scores for the final predicted masks. In our open-world setting, we don’t need to use predefined object categories. Instead, we directly generate free-form captions.

Also, note that [1] operates in a closed-world setting. We have carefully pointed out the differences between [1] and our method in Sec 2.3 of the main paper and elaborated on the differences further in Appendix A.1.

3. The technical novelty ... seems limited.

Please see the “Technical Novelty” section in our comment addressed to all reviewers.

Experiment

1. Utilize widely used validation sets like YouTube-VIS. For captioning, besides METEOR, ... CIDEr and ROUGE would enhance the evaluation.

For the closed-world, we have reported results on the OVIS dataset (Tab. 5 of the Appendix), since it is more diverse and challenging than the YouTube-VIS dataset, and since it is also widely used these days. Please also see the “Experimental Performance” section in our comment which addresses all reviewers.

Our CIDEr and ROGUE-1 scores for OWVISCapTor+CAROQ (second last row in Tab. 1) are 1.03 and 0.54 respectively. However, please note that the other method we compare to (DVOC-DS), only reports METEOR. Hence, we can’t compare with this method using these additional captioning metrics.

2. The improvements are slight. a) In Table 1, ... does not achieve SoTA on more than half of the evaluation metrics. b) In Table 3, why do the DetA and AssA remain unchanged?

a) Please see the “Experimental Performance” section in the comment addressed to all reviewers.

b) Tab. 3 shows how different configurations affect the captioning performance. Detection and association (DetA and AssA) are performed using exactly the same object abstractor.

评论

Thanks to the authors for their response. After reading the comments from other reviewers and the responses, I am still concerned that 1) the task is more like a combination of segmentation and captioning in the video; 2) the technical novelty of each proposed part is limited, which composes existing components as a model for the task; 3) the results (Table 1) do not convincingly demonstrate the effectiveness of the proposed method. If the unified framework does not surpass the performance of individual models on their respective tasks, it raises the question: why not develop a specialized model for each task instead? Thus, I will maintain my initial score.

评论
  1. We kindly disagree with the reviewer. Our task requires a holistic object understanding, where the object queries are expressive enough to be used for not only segmentation (closed and open-world), but are also expressive enough to generate meaningful object-centric captions. This is an important step towards generalized scene understanding. There are existing generalized tasks in the literature that combine two or more specific tasks (eg., Multi-Object Tracking and Segmentation and Video Instance Segmentation combine image segmentation and Multi-Object Tracking). But they play a critical role in holistic video understanding. In this work, we aim towards more generalizability.

  2. Our individual technical contributions (open-world object queries, masked attention for object-centric captioning, and inter-query contrastive loss) together are effective in forming a generalized method for our novel and important OW-VISCap task. We think that sharing these contributions with the community will accelerate research in object understanding in Vision Language Models.

  3. Our unified framework surpasses the performance of individual models on the tasks we care about (open-world performance and captioning ability) as shown in Table 1. While specialized models are great for closed-world segmentation, in this work, we are not interested in developing specialized models because that doesn’t encourage a holistic understanding of objects in videos. In this work, we care about generalizability.

评论

We hope the above reply helped to answer your questions and helped to provide our rational. Feel free to reach out with additional questions which we are happy to answer. Thanks a lot for your time and consideration!

评论

Dear Reviewer zWxZ, as the NeurIPS AC, I would like to remind you to check and read the rebuttal provided by the authors. After that, please respond or acknowledge authors ' rebuttal and update your rating if necessary. Many thanks.

作者回复

We thank all reviewers for their helpful feedback. We are thrilled that they find our work novel (Reviewer 9nUt), thoroughly analyzed (Reviewer W8BT), and well-written (Reviewer zWxZ). Reviewers zWxZ and W8BT have raised questions about the technical novelty and experimental performance of our method, which we address below.

Technical Novelty:

We think that we identified an important problem that is of interest to our community, and we developed a simple yet effective baseline solution. Our main technical contributions are twofold:

a) Our object abstractor is designed to handle open-world objects. As Reviewer zWxZ rightly pointed out, our object abstractor builds upon existing closed-world object abstractors, albeit we also introduce an interquery contrastive loss. This being said, our main contribution is the development of open-world object queries via a prompt encoder. The developed technique improves upon prior SOTA EntitySeg+DEVA by 5.6 points on open-world tracking accuracy (OWTA) for unseen objects on the BURST dataset, as summarized in Tab. 1.

b) We introduce an important component, masked cross attention, in the object-to-text abstractor. It effectively allows us to perform fine-grained object-level captioning. The design permits to focus on the objects without losing the overall image context, and significantly outperforms naive baselines, as demonstrated in Tab. 3. As far as we know, masked attention has not been used for object-level captioning before. We think this is an important contribution.

Experimental Performance:

We would like to highlight three points:

a) Our work specifically addresses two major problems: open-world object discovery and object-level captioning. Hence we are mainly interested in the metrics that specifically cater to these aspects (highlighted with blue in Table 1: OWTA (unseen) and CapA). VIS is not the focus of this work. We obtain the best results on these metrics: 5.6 points higher OWTA (unseen) on the BURST dataset and 7.1 points higher CapA on the Dense VOC dataset. We are also the overall SOTA on the Dense VOC dataset, improving prior SOTA DVOC-DS by 1.5 points. Our VIS performance is similar to the closed-world baselines we build upon, i.e., Mask2Former and CAROQ, as seen from the results on the OVIS data (Tab. 5 in the Appendix).

b) Generalizability: We want to highlight the generalizability of our work. Our approach simultaneously segments, tracks, and captions both never-before-seen and previously seen object categories in videos; previous specialized methods for dense video object captioning (Dense VOC) and open-world VIS (OW-VIS) cannot achieve this. DVOC-DS, the previous SOTA for Dense VOC, cannot handle never-before-seen objects (open-world) or multiple action segments for the same object. Although DVOC-DS achieves better closed-world detection and association accuracies (DetA and AssA), we significantly improve the captioning accuracy (CapA) of the Dense VOC task by 7.1 points. We are the SOTA on the overall metric: captioning higher order tracking accuracy (CHOTA), while also being able to operate in an open-world setting. While we perform slightly worse than the OW-VIS SOTA on seen objects on the BURST dataset, we obtain a 5.6 points higher open-world tracking accuracy (OWTA) on unseen objects. Additionally, we can generate free-form captions for predicted objects, which OW-VIS SOTA can’t.

c) Stronger baselines for closed-world: Our contributions (open-world object queries, object-to-text abstractor augmented with masked attention, and inter-query contrastive loss) can be integrated with any VIS pipeline. Integrating our method with a strong VIS pipeline would lead to a stronger VIS performance. We currently build on top of Mask2Former and CAROQ. Hence our VIS performance is similar to Mask2Former and CAROQ, as seen from the results on the OVIS dataset (Tab. 5 in the Appendix), using the Resnet-50 backbone. As mentioned in point (a), VIS is not the focus of our work, we focus on orthogonal components: open-world object discovery and object-captioning.

最终决定

This submission proposes a new task and the corresponding model: detecting, tracking, segmenting, and captioning open-vocabulary objects in a video. This is quite a challenging task and requires an ML model that has multiple capabilities. Although the submission received mixed reviews but generally positive. Most of the concerns have been addressed in the rebuttal stages except that the reviewer zWxZ still had some concerns about the technical novelty and the experimental results. The authors had an extensive explanation on these concerns and the AC agreed with the authors that the framework surpasses the performance of individual models on the open-world performance and captioning ability is valid. However, the AC also suggests the authors include more discussion on the Experimental Performance in the final version.