One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
We introduce VideoLISA, a video-based multimodal large language model designed to address the challenges of language-instructed reasoning segmentation in videos.
摘要
评审与讨论
This work introduce a Sparse Dense Sampling strategy and a One-Token-Seg-All approach to enhance the temporal ability of LISA. For the Sparse Dense Sampling strategy, the work preserves dense tokens of some frames and extract sparse tokens of interleaved frames. For the One-Token-Seg-All approach, the work apply the feature of one embedding can represent objects across multiple frames. Additionally, this work contributes a reason VOS benchmark.
优点
This work proposes a benchmark for reasoning video segmentation, including 458 video-instruction samples.
缺点
- This work utilizes two visual encoder. It's redundant, and will influence the speed of the model.
- The model fails to segment multiple objects at the same time.
问题
- The results when replace the SAM with other kernel-based segmenting and tracking models.
- Compared with QFormer [1] and QFormer-like archectures [2], what are advantages of the Sparse Dense Sampling strategy?
[1] Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International conference on machine learning. PMLR, 2023: 19730-19742. [2] Li Y, Wang C, Jia J. LLaMA-VID: An image is worth 2 tokens in large language models[J]. arXiv preprint arXiv:2311.17043, 2023.
局限性
- This work utilizes two visual encoder. It's redundant, and will influence the speed of the model.
- The model fails to segment multiple objects at the same time.
We thank the reviewer for reviewing our paper.
Q1: This work utilizes two vision encoders, which will influence the speed of the model.
Response:
- We thank the reviewer for pointing out the valuable problem. In fact, the two vision encoders in the model are both necessary as they have their own unique role respectively. The CLIP-based vision encoder (termed as Visual Tokenizer in the paper), being pre-trained with image-text pairs, is responsible for semantic feature extraction, which is suitable for multimodal understanding. On the other hand, the SAM-based vision encoder, providing low-level vision features, is specially designed for producing segmentation masks, which is also adopted in a wide range of existing works, e.g., LISA.
- At the time we implement VideoLISA, as far as we know, the best model for these two types of features was still the expert model on each task. I.e., CLIP for semantic features and SAM for low-level features. In the future, if there are models that unify these features, we will consider replacing the two encoders with the new one to improve speed. We will discuss this problem in the revised paper.
Q2: It seems like the model has difficulty segmenting multiple objects at the same time.
Response:
- Firstly, we would like to emphasize that single object segmentation is the standard and most popular setting in the general field of language-guided object segmentation. In referring image/video segmentation [1,2,3] tasks and the reasoning image segmentation task [4], typically a text query describes a specific object in a straightforward or implicit way. The model is tasked to segment the object based on the given text description.
- Our work follows the widely adopted setting and democratizes reasoning segmentation to videos. In our paper, we did not make any claim regarding multi-object capability. We thank the reviewers for providing the valuable suggestion. We regard this as a meaningful future work and will discuss the possibility in the revised paper.
Q3: Curious about the results when replacing the SAM with other kernel-based segmenting and tracking models.
Response: We thank the reviewer for raising this interesting question. In fact, our One-Token-Seg-All approach exactly follows the philosophy of kernel-based tracking, as discussed in line 185-188 of our paper. The one [TRK] token is trained to serve as the semantic kernel while the visual features are the context to be contrasted.
Following the reviewer's suggestion, we explore other kernel-based segmentation and tracking models and report the experiment result in Table 4 of the rebuttal PDF. Specifically, we adopt XMem [5], which is a popular tracking model. It is equipped with a system of memory. The standard XMem model accepts the segmentation mask of the first frame, builds the memory, and inferences the masks of subsequent frames. The first frame memory here can be regarded as a semantic kernel for reference purpose.
In our experiment, we adapt this model into LISA [4] and VideoLISA. The results are reported in Table 4 of the rebuttal PDF.
- Our VideoLISA is compatible with XMem and shows remarkable performance in public benchmarks.
- When comparing LISA and VideoLISA under the same XMem setting, VideoLISA outperforms LISA by a noticeable margin.
- When comparing XMem and our One-Token-Seg-All, our method still shows superior performance, validating the effectiveness of the proposed method.
Q4: What are the advantages of the Sparse Dense Sampling strategy compared with QFormer and QFormer-like architectures (e.g. LLaMA-VID [6]) ?
Response: Q-Former and QFormer-like architectures (e.g. LLaMA-VID [6]) extract highly abstract semantic visual features through cross-attention, which significantly reduces computational overhead, especially for video data. However, although such highly abstract features might be okay for general video understanding tasks, such as VQA, they are not sufficient for the segmentation task, which is validated by our experiments. In contrast, our Sparse Dense Sampling achieves a delicate balance between preserving visual details and temporal context, making it favorable for video object segmentation tasks.
We conduct a comparison experiment between LLaMA-VID [6] and VideoLISA as reported in Table 3 of the rebuttal PDF. The result shows that our proposed strategy achieves remarkable performance in video segmentation tasks, surpassing the QFormer-style architecture.
We sincerely hope the rebuttal answers your questions and addresses your concerns.
[1] Kazemzadeh S, Ordonez V, Matten M, et al. Referitgame: Referring to objects in photographs of natural scenes[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 787-798.
[2] Seo S, Lee J Y, Han B. Urvos: Unified referring video object segmentation network with a large-scale benchmark[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer International Publishing, 2020: 208-223.
[3] Ding H, Liu C, He S, et al. MeViS: A large-scale benchmark for video segmentation with motion expressions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 2694-2703.
[4] Lai X, Tian Z, Chen Y, et al. Lisa: Reasoning segmentation via large language model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 9579-9589.
[5] Cheng H K, Schwing A G. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 640-658.
[6] Li Y, Wang C, Jia J. Llama-vid: An image is worth 2 tokens in large language models[J]. arXiv preprint arXiv:2311.17043, 2023.
Thanks to the author's rebuttal, which has completely solved my doubts; however, as other reviewers mentioned, after training with reasoning segmentation, the model loses the ability of text-generation, which needs to be noted.
Dear Reviewer oRTn,
Thank you very much for your positive feedback! We are glad to hear that our rebuttal has completely resolved your concerns!
Regarding the minor limitation you mentioned, as we discussed in our rebuttal, degraded text generation capability is a common problem shared by many models (e.g., LISA). Maintaining the two functions in one model is a meaningful yet independent research problem, involving data curation, training strategy design, etc. We are glad to include this discussion into our revised paper to inspire future works.
Yet, we believe this common limitation does not affect our contributions, as our VideoLISA focuses on processing videos, which is orthogonal to the compatibility of text-generation. We thank the reviewer again for the valuable suggestions, which are very helpful in improving our work.
This work proposes VideoLISA, a video-based multimodal large language model that addresses the challenges of language-instructed reasoning segmentation in videos, leveraging various strategies to enhance temporal understanding and consistent object tracking, and showing promising generalization capabilities. The proposed Sparse Dense Sampling strategy is able to reduce computational costs, and a specially designed TRK token can segment and track the object across frames. They also establish a benchmark, demonstrating VideoLISA's performance.
优点
- VideoLISA is the first video-LLM that democratizes reasoning segmentation to videos, which generates temporally consistent segmentation masks in videos based on language instructions.
- The experiments evaluate the model performances from multiple segmentation benchmarks which demonstrated the improvements.
- Many qualitative results and comparisons are provided.
缺点
- The token "TRK" is only one token designed for the task, so it seems the model cannot generate multiple segmentation mask tracklets.
- VideoLISA has limited technical contributions in its design, where only the sampling strategy is specially designed for the video domain, and the "TRK" token seems to be a simple extension from the original LISA with a simple adaptation.
- Pixellm is another reasoning segmentation work that is able to segment multiple targets and support multi-round conversations, which is not compared in the paper.
问题
- How to use this single token to generate segmentation masks for multiple targets?
- I wonder how the model performs on text-generation task, does the model preserve the original ability to perform conversation?
- The sampling strategy reminds me of the strategy used in LLaMA-VID, which compresses the visual information of each frame to one token. I am curious about the performance of adding a segmentation head to such models, which is capable of both generating texts and producing masks.
局限性
Yes, authors addressed the limitations of their work.
We thank the reviewer for reviewing our paper.
Q1: It is not clear whether the model can segment multiple objects
Response:
- Firstly, we would like to emphasize that single object segmentation is the standard and most popular setting in the general field of language-guided object segmentation. In referring image/video segmentation tasks and the reasoning image segmentation task, typically a text query describes a specific object in a straightforward or implicit way. The model is tasked to segment the object based on the given text description.
- Our work follows the widely adopted setting and democratizes reasoning segmentation to videos. In our paper, we did not make any claim regarding multi-object capability. We thank the reviewers for providing the valuable suggestion. We regard this as a meaningful future work and will discuss the possibility in the revised paper.
Q2: The [TRK] token design seems to be simple.
Response:
- We would like to highlight that we do not pursue complexity in the model design. Instead, we aim to design a framework that is effective and suited to solve the problem. In fact, the approach of using a single [TRK] to segment all video frames is intentionally designed to be simple to avoid unnecessary complexity.
- The underlying rationale of the design comes from in-depth analysis and extensive experiments. Our motivation is that a single unified representation has the potential to segment video frames, as revealed by related works. Thus, a single [TRK] token, serving as an information bottleneck, is tasked to learn a compact object representation. However, this is non-trivial to achieve, as shown by our investigation on existing models, including SAM and LISA. Through experiments and analysis, we identify two key factors: incomplete input information and inappropriate training objectives. We thus enhance the cross-frame association capability of the [TRK] token from the two perspectives. We argue that the One-Token-Seg-All approach is simple yet effective by design. The analysis, experimental exploration, and the training objective design make the approach effective and non-trivial.
Q3: It is better to compare with another related work, PixelLM.
Response: We thank the reviewer for providing the valuable suggestion. We discuss the PixelLM model and compare it with our model. We will include the discussion into Related Work and the comparison into Experiment in the revised paper.
- Model design discussion. PixelLM is a large multimodal model for pixel-level image reasoning and understanding. Compared to LISA, PixelLM exhibits unique advantages in handling multiple objects. PixelLM excels in image-based tasks while it is inherently incapable to handle videos. Our VideoLISA aims to democratize the reasoning segmentation capability into videos. When comparing VideoLISA with PixelLM, VideoLISA highlights video segmentation while PixelLM emphasizes multi-object segmentation.
- Experimental comparison. We compare LISA, PixelLM, and VideoLISA in public benchmarks, as shown in Table 1 of the rebuttal PDF. Since LISA and PixelLM are not designed for videos, we adopt the One-Token-Seg-All approach similar to VideoLISA. Specifically, the prompt embedding from the first frame is used to segment the subsequent frames. The results show that VideoLISA outperforms LISA and PixelLM by a large margin. This performance gap comes from VideoLISA's dedicated designs that handle video temporal dimension, making it able to understand video content and output temporal-consistent masks.
Q4: It is better to show how does the model perform on text-generation tasks.
Response: We thank the reviewer for pointing this out.
- In VideoLISA, we intentionally re-purpose the large multimodal model into an expert model on reasoning video object segmentation. Our main focus is to democratize the reasoning segmentation capability into videos. The proposed designs are specially tailored for video temporal dimension. Thus, the text-generation capability is not specially considered.
- To assess this capability, we evaluate VideoLISA on popular multimodal understanding benchmarks, including MME, Science-QA, GQA, and TextVQA in Table 2 of the rebuttal PDF. We found that compared to the original large multimodal model, i.e., LLaVA, VideoLISA shows significant performance degradation in text-generation tasks. This is not surprising as the training datasets of VideoLISA are mainly composed of segmentation data. The model has been optimized for reasoning and localization.
- We notice that this is a common limitation among reasoning segmentation models. As shown in Table 2 of the rebuttal PDF, LISA model even shows a much worse performance. Developing segmentation capability while preserving the chat capability is not trivial as it involves various aspects, such as data curation, training strategy, etc. We will discuss this in the revised paper to inspire future work.
Q5: I am curious about the performance of adding a segmentation head to models like LLaMA-VID.
Response: LLaMA-VID compresses each video frame into two tokens, reducing computational cost. We adopt LLaMA-VID architecture and add a segmentation head similar to VideoLISA. We report the experiment results in Table 3 of the rebuttal PDF. To answer the reviewer's question: 1) LLaMA-VID, equipped with a segmentation head, is capable of doing segmentation and achieves decent performance across benchmarks; 2) when being compared with VideoLISA, LLaMA-VID shows worse performance across all evaluated benchmarks. The performance gap comes from that LLaMA-VID compressing the visual tokens into extremely low resolution, i.e., two tokens only. This compression inevitably lost visual spatial details, which are essential for segmentation.
We sincerely hope the rebuttal answers your questions and addresses your concerns.
Thank you for the thorough responses. The responses partially address my concerns. However, there are still a few points that are not convincing.
- Capacity of segmenting multiple objects. I understand segmenting multiple instances is much more difficult than segmenting a single object. However, the name of your approach, "one token seg all", emphasizes the capacity to segment multiple objects at once with a single token.
- As for the comparison of PixelLM with your approach, can you provide further analysis of the performance gap? Because PixelLM is capable of handling multiple objects while yours isn't. The major difference seems to be that you have the temporal association module while previous methods don't. However, temporal association (via one token) is not novel in previous video segmentation tasks.
Dear Reviewer vGHj,
Thank you very much for your positive feedback and valuable suggestions! We are glad to hear that our rebuttal has addressed some of your concerns.
Regarding the two points you mentioned:
-
We would like to clarify that "One-Token-Seg-All" emphasizes the capacity of using one single token to segment multiple video frames, rather than multiple objects. We also do not claim the ability to handle multiple objects in our paper.
- As we discussed in the rebuttal, single-object referring/reasoning segmentation is a standard and widely adapted setting. Since our main focus is video data, rather than single versus multi-object, we believe this standard setting is suitable and effective to assess our method.
- We greatly thank the reviewer for pointing out this potential misleading expression in the name of the approach. We will consider revise the name and add more clarification in the revised paper.
-
In the comparison between VideoLISA and PixelLM on video object segmentation benchmarks, VideoLISA shows better performance than PixelLM. The performance gap mainly comes from the model's design on processing videos. We provide the analysis below and will add this in our revised paper.
- PixelLM focuses on multi-object in images but has little design for videos. When applying such image-based models on videos, they present two challenges: 1) cannot understand video temporal dynamics, making it unable to handle temporal-related text queries, such as actions, state changes, etc. 2) cannot output temporal-consistent masks. The two issues make image-based models, including LISA and PixelLM, struggle to properly process video data.
- Assume that even we equip the image-based model with a "temporal association module", it is still hindered by the first challenge on video understanding. This argument has been validated by experiments in our paper. In Table 6 of the paper and Table 9 of the appendix, we tried to equip LISA with a "temporal association module" XMem. It still shows significant worse performance than VideoLISA.
- In contrast, our VideoLISA is dedicatedly designed for videos. Rather than a "temporal association module", our VideoLISA actually has two novel designs for videos, including a Sparse Dense Sampling strategy and a One-Token-Seg-All approach. The design motivation and rationale have been discussed in the rebuttal. The effectiveness has been validated by extensive experiments in our paper. Corresponding to the two challenges stated above, both of the two modules are essential. The Sparse Dense Sampling makes the model be aware of both spatial and temporal information in videos while the One-Token-Seg-All enables temporal consistent segmentation.
- From the functionality perspective, PixelLM addresses image understanding, language reasoning; previous video segmentation models may mainly address image (frame) understanding and video temporal association. Our VideoLISA is the first model that systematically integrates all these features, and this is non-trivial to achieve (not simply stacking these existing models together). As mentioned above, the experiment in Table 9 of the appendix also demonstrates that simply stacking these existing models only yields sub-optimal results.
We greatly thank the reviewer for the feedback and constructive questions, which are very helpful in improving our work. We sincerely hope this response answers your questions. Any questions/comments are warmly welcomed!
Thank you for your response. Since most of my concerns have been addressed, I will raise the score to BA. However, in the main text, the discussion and comparison of different image reasoning methods with different temporal association designs (other than XMem) are quite limited. Also, the experiments comparing the dense-sparse sampling strategy with the QFormer structure should be included to address the novelty. What's more, the capability of degraded text generation should be discussed and mentioned (For example, PixelLM can have conversations with users while LISA can not). I hope the final version will be more organized in this regard.
Dear Reviewer vGHj,
We are glad to hear that our response has successfully addressed your concerns!
We sincerely thank you for your constructive suggestions. We will follow your suggestions to include more comprehensive comparisons in our revised manuscript, including different image reasoning segmentation methods with different temporal association designs, and Q-Former architecture, and so on. Moreover, we will discuss the problem of text generation capability to inspire more future works.
Thank you again for your great efforts to help us improve!
This paper introduces VideoLISA, a multimodal LLM for reasoning segmentation in videos. A Sparse Dense Sampling strategy is proposed to balance the temporal context and spatial detail for video modeling. Extensive results on various benchmarks demonstrate the effectiveness of the proposed method.
优点
The paper is well-organized and easy to follow.
The introduction of the VideoLISA model and the ReasonVOS benchmark establishes a new paradigm in video segmentation, making it highly inspiring.
缺点
-
The proposed Sparse Dense Sampling Strategy and One-Token-Seg-All framework are both very simple and lacks sufficient innovation.
-
The setups of some ablation studies in the article are not very clear, e.g. whether the n of n-frame in Table5 is T-sparse or T-dense is not clarified, how to combine XMem with LISA in Table 6 is also not introduced.
问题
Please refer to the weakness part.
局限性
The limitations have been discussed in the paper.
We thank the reviewer for reviewing our paper.
Q1: The two proposed modules seem relatively simple.
Response:
- We would like to first emphasize that we do not pursue complexity in the model design. Instead, we aim to design a framework that is effective and suited to solve the problem. The underlying rationale of the model design comes from in-depth analysis of the task and extensive experiments. In our VideoLISA, to address the unique challenges in marrying LLM with video object segmentation, we propose two key innovations.
- Firstly, a Sparse Dense Sampling strategy is designed to enable LLM to capture and understand temporal dynamics in videos. By leveraging the inherent temporal redundancy property of videos, this strategy achieves a delicate balance between preserving visual details and temporal context, making it favorable for video object segmentation tasks. In the evaluation of this approach, we also unveil two important properties of the ReasonVOS task: 1) Sparse temporal tokens can boost the performance by introducing more temporal cues; 2) Preserving dense spatial tokens is essential for segmentation tasks. In contrast, pooling on the spatial dimension will cause performance degradation. Both of the two conclusions are important lessons to the community and can support the rationality of our approach.
- Secondly, we propose a One-Token-Seg-All approach to achieve temporally consistent segmentation masks in the promptable mask decoding paradigm. Our motivation is that a single unified representation has the potential to segment video frames, as revealed by related works. Thus, a single [TRK] token, serving as an information bottleneck, is tasked to learn a compact object representation. However, this is non-trivial to achieve, as shown by our investigation on existing models, including SAM and LISA. Through our in-depth analysis and experiments, we identify the two key factors: incomplete input information and inappropriate training objectives. We thus enhance the cross-frame association capability of the single token from these two perspectives. We argue that the resultant One-Token-Seg-All approach is simple yet effective by design. The analysis, experimental exploration, and the training objective design make the approach effective and non-trivial.
Q2: The settings of a few ablation studies were not clearly stated.
Response: We thank the reviewer for pointing out the writing flaw. Due to page limit, we present the ablation studies in a concise and compact way in the main paper. In the Appendix, we report more detailed settings and results of the ablation studies. Here we address the two points mentioned by the reviewer.
- N-frame experiment setting. This setting is stated in line 616-622 of the Appendix. In Table 5, the n-frame setting directly concatenates the visual features from n sampled frames as input to the large language model. In our implementation, the value of n is set to the same as T_dense for comparison. This implementation is mainly based on computational cost, as T_sparse can be a relatively large number (e.g., 32 in our implementation).
- How to combine XMem with LISA in Table 6. This setting is stated in line 646-653 of the Appendix. “During inference, LISA outputs the segmentation mask of the first frame based on language instruction. The tracker then tracks the segmented object through the video, yielding segmentation masks for the subsequent frames. Specifically, we adopt the popular XMem model as the tracker, as shown in the second row of the table.”
We sincerely hope the rebuttal answers your questions and addresses your concerns.
Thanks for the thorough explanation. The responses address most of my concerns, I will raise my rating to borderline accept.
We thank the reviewers for their time and efforts in reviewing our paper. We respond to the reviewers' questions in their own thread separately and place the mentioned tables in the PDF.
Dear Reviewers,
Thanks for serving as a reviewer for the NeurIPS. The rebuttal deadline has just passed.
The author has provided the rebuttal, could you help check the rebuttal and other fellow reviewers' comments, make necessary discussions, and update your final recommendations as soon as possible?
Thank you very much.
Best,
Area Chair
This paper was reviewed by three experts in the field. The recommendations are (Borderline Accept x 3). Based on the reviewers' feedback, the decision is to recommend the acceptance of the paper. The reviewers did raise some valuable concerns (especially the insufficient technical innovation by Reviewers urmZ and vGHj, missed detailed and important comparisons with previous literature by Reviewers vGHj and oRTn) that should be addressed in the final camera-ready version of the paper. The authors are encouraged to make the necessary changes to the best of their ability.