PaperHub
6.8
/10
Poster4 位审稿人
最低5最高8标准差1.3
8
8
5
6
4.5
置信度
正确性3.3
贡献度3.5
表达3.3
NeurIPS 2024

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

OpenReviewPDF
提交: 2024-05-04更新: 2024-12-19
TL;DR

We present a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos.

摘要

关键词
Multimodal Large Language ModelUnified Large Language Model

评审与讨论

审稿意见
8

This paper proposes a unified vision LLM for various downstream visual tasks, including understanding, generating, segmenting, and editing. The proposed framework addresses the problems with respect to images, videos, text, and human interactions. The authors also propose a hybrid method to integrate discrete textual instructions and continuous signal embeddings.

优点

From low-level to high-level semantics, the proposed method address the vision problems in a unified framework. In addition, the generation and editing tasks are also included in this framework. The proposed instruction-passing mechanism over discreet texts and continuous embeddings help to address the different downstream tasks. The experiments are comprehensive.

缺点

There is no obvious technical weaknesses from my side. However, I do have some questions about this paper.

  1. The authors put "pixel-level" in a highly key position of this paper. From my side, as the proposed model supports different visual tasks, among them, many downstream tasks are not pixel-level such grounding. Why did the authors highlight the concept?
  2. The implemental details are missing, especially the synergy module. How can the readers know the re-training and designing process in detail?

问题

  1. To what extent, can the "unification" be used for the multimodal tasks?
  2. The authors proposed the statements about instructions and cooperation in Line 57-59. How did they reflect these aspects in the designed modal and experiments?
  3. How did the model distinct the features from task-specific and task-invariant parts? In other word, did the authors do something to assign the aims during the model training?

局限性

As the most parts are clear for me, I am looking forward to the answers to the above questions. I suggest to add more details of the proposed module and the training scheme in the appendix.

作者回复

We take your opinion “There is no obvious technical weaknesses from my side” as the greatest acknowledgment of the value of our work. Thank you very much! Your affirmation provides us with great motivation to further improve our research.


Q1: The authors put "pixel-level" in a highly key position in this paper. From my side, as the proposed model supports different visual tasks, among them, many downstream tasks are not pixel-level such as grounding

A: From a broader perspective, the research on MLLM is very vibrant and rapid. And future research on MLLM will inevitably develop towards a more unified approach, covering more modalities in breadth and achieving more powerful task processing capabilities in depth. This work focuses more on the latter. For visual MLLM, the initial visual MLLMs could only support coarse-grained, instance-level understanding or generation, but a key problem here is object hallucination due to the lack of pixel-grounding capabilities. To solve these issues, the community has gradually developed more advanced visual MLLMs that can support fine-grained, pixel-level visual understanding and generation, including visual grounding, segmentation, editing, etc. It's reasonable because to achieve stronger task performance, it is necessary to have stronger visual processing capabilities. In fact, if a model has fine-grained visual processing capabilities, it will definitely have stronger capabilities in coarse-grained tasks. Therefore, we emphasize achieving pixel-level visual capability in this paper.


Q2: The implemental details are missing, especially the synergy module

A: Apologies for the lack of necessary technical descriptions due to the 8-page limit. Please refer to Appendix E.3, 'Cross-task Synergy Learning', where we provide an extended introduction about this module. Also, in Appendix F, 'Extended Details of Experimental Settings', we extended the introduction to the overall framework's implementation. Moreover, our provided code includes detailed implementation information. Our system, including the code and all data, will be open-sourced. We will continue to improve the open-source code and provide a detailed manual.


Q3: To what extent can the "unification" be used for multimodal tasks?

A: Unification refers to using a single model or system to execute a variety of tasks, just as using one ChatGPT can perform all NLP tasks. For vision multimodal tasks, we have categorized all visual tasks into four orthogonal grand categories. This way, we use a separate backend module to support each category. Specifically, for categories like visual comprehension, our system will cover almost all image/video-to-text scenario tasks.

Going beyond the unification of pixel-level visual MLLM, in this paper, we also emphasize achieving synergy across multiple modalities and tasks to realize a true generalist towards human-level AI. We can understand synergy as a type of generalization ability, only when MLLM achieves generalization across different modalities and tasks can we say we have reached the next level of MLLM towards human-level AI, similar to how ChatGPT achieves cross-task synergy. However, we have not yet seen such generalization capabilities in MLLM (or multimodal unified models). The Cross-task Synergy Learning mechanism proposed in Vitron is a key starting point towards that goal.


Q4: authors proposed statements about instructions and cooperation. How did they reflect these aspects in the designed model and experiments?

A: We propose a hybrid mechanism of message passing, where we combine both discrete textual instructions and continuous signal embedding methods. The former aids in accurately invoking different backbone modules (thanks to the LLM’s proficiency in task dispatching), while the latter supplements with richer modality-preserved visual features that cannot be directly described through discrete text. As depicted in Fig. 2, the LLM outputs 1) text responses for users, 2) text instructions for module invocation, and 3) feature embeddings of special tokens. Both text instructions and feature embeddings are passed to backbone modules. In Figure 4, we explored these two different message-passing mechanisms to determine whether discrete textual instruction is more beneficial or whether continuous signal embedding is better for building a multi-modal generalist. Also, there we validated the pros and cons of the proposed hybrid method of message passing.


Q5: How did the model distinguish the features from task-specific and task-invariant parts?

A: Technically, we employ adversarial training to decouple task-specific from task-invariant features. We first let different backbone visual specialists make task predictions based on these two features (via concatenation). Meanwhile, we encourage a third-party discriminator (acting as a classifier) to determine which is the current task based solely on the shared feature representation. Ideally, once the discriminator can no longer accurately identify the task, the shared feature can be considered the most purified and broadly applicable across tasks. We further detailed formal task modeling in Appendix E.3, 'Cross-task Synergy Learning'.

During the Embedding-oriented Decoder Alignment Tuning stage, we align the feature embedding with all the visual module’s input encoders via the decoding-side projection layers. We do this feature alignment learning by minimizing the distance between the projected feature embedding and the module’s input encoder. For example, for diffusion-based image or video generation, we may directly use the textual condition encoder, while keeping all other modules fixed.


Q6: I suggest adding more details of the proposed module and the training scheme in the appendix.

A: Thank you for your suggestion, we will adopt it and reflect it in the revision.

评论

Thank you for the detailed responses to my questions and comments.

However, the response to Q3 is still confusing to me. And after checking other reviewers' comments, I have a similar concern regarding the novelty. Thus, I incline to decrease the rating a bit to Weak Accept.

评论

Dear reviewer #vBVg,

Thank you for your response and feedback.

If possible, could you please more specify the points of confusion in Q3? We would be more than happy to provide further explanations and clarifications.


Regarding the novelty of this work, we'd like to re-emphasize it again.

From idea perspective:

  1. We for the first time introduce a grand unified vision MLLM, Vitron, which is capable of instance- and pixel-level understanding, generating, segmenting, and editing both images and videos in a "one-for-all" manner.

  2. Also we propose the idea of achieving cross-modal and cross-task synergy (i.e., generalization ability) for realizing true generalist capabilities towards human-level AI. This is also the core and most valuable aspect of our work.

From technical perspective:

  1. We have devised a novel hybrid instruction-passing mechanism that combines explicit textual instructions with implicit signal representations, enabling the Vitron to support comprehensive and powerful fine-grained visual processing capabilities.

  2. We introduce the cross-task synergy learning mechanism, which allows Vitron to achieve visual generalization not merely by invoking or recalling tools like an agent, but by emphasizing the synergy across multiple modalities and tasks.


We have held the greatest gratitude to you when we received very strong support from you, i.e., Strong Accept. And now that you have further questions, and we are eager to address them.

We look forward to your further feedback. We hope you can maintain the chance for us to address any concerns you may have.

Best regards

审稿意见
8

This paper introduces a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing both static images and dynamic videos. Technically, the authors propose using the LLM as the core brain, incorporating different encoders for images, videos, and pixel-level regions to extend the comprehensive capabilities of LLMs, and various decoders for images, videos, and segmentations to extend the generative capabilities of LLMs. Additionally, a well-rounded hybrid message-passer, incorporating discrete textual instructions and continuous signal embeddings, is designed to pass messages from the LLM to backend decoders. A novel cross-task synergy module is introduced to enhance the synergy between different tasks. To demonstrate the efficacy of the proposed method, the authors conduct experiments on 12 visual tasks and evaluate across 22 datasets. Extensive experiments show that the proposed method can handle multiple vision-language tasks, spanning from visual comprehension to visual generation, and from low-level pixel understanding to high-level semantic understanding.

优点

  1. The motivation of this work is clear and reasonable. Building a universal generalist is a trending topic, achieving an "one for all" capability. Images and videos are two core vision interfaces through which humans understand the world; thus, powerful vision-language generalists should be capable of comprehending and generating images and videos.
  2. This work presents a well-designed unified framework, integrating the SoTA specialists and enabling performance across various vision-language task groups from vision comprehension to generation, and from low-level pixel understanding to high-level semantic comprehension.
  3. This work proposes a hybrid instruction passing method by combining discrete textual instructions and continuous signal feature embeddings to ensure effective and precise information passing to the modules.
  4. The authors design a fine-grained spatial-temporal vision grounding instruction tuning method, enabling sufficient pixel-level visual perception. Furthermore, they apply adversarial training to enhance synergy between different tasks, achieving overall performance improvement.
  5. Extensive experiments are solid, and conducted on various datasets and tasks, including vision segmentation, fine-grained vision understanding, and vision generation. The experimental results show that after integrating cross-task synergy learning, the framework consistently improves performance across the majority of tasks.
  6. this work builds a text invocation instruction tuning dataset. If it is publicly available, it will be beneficial to the development of related research.

Overall, I like this paper. I think the proposed idea is novel and interesting, and also intuitive. I tend to believe the contribution, novelty and quality of the paper to be highly aimed.

缺点

  1. The authors may need to clarify the advantages of their work over existing generalists like HuggingGPT and Visual ChatGPT, which also aim to be generalist models to some extent.
  2. The authors propose a hybrid message-passing mechanism and provide a corresponding ablation study. But it would be better to have an intuitive demonstration of the information complementation between discrete textual instructions and continuous feature embeddings. For example, which types of information are passed by textual instructions and which are conveyed by the embeddings.
  3. GPT-4 is employed to construct the text invocation-oriented instruction tuning datasets. The detailed prompt templates and some examples should be provided.
  4. Some symbols need clarification, such as the J and F in Table 3.
  5. A value bar should be added to intuitively show the synergy across different tasks. Additionally, the calculation of the synergy between tasks should be clarified.

问题

  1. I have a question that I would like to discuss with the authors: What are the current limitations of the unified model? Given that the performance of specialists is still not very satisfactory, what is the significance of a unified model? In the future, if the form of specialist tasks changes (e.g., the emergence of vision LMs rather than LLMs being at the core), what modifications would the current unified model need to undergo?
  2. If there are better specialists available, how can they be replaced, and what would be the cost of doing so?
  3. Are other modalities, such as audio or 3D, being considered for inclusion with fine-grained capabilities?

局限性

N/A

作者回复

Thank you for recognizing our work and providing insightful comments that significantly enhance the quality of our paper. Here are our responses to your concerns and questions, and we hope to gain your further support.


Q1: The authors may need to clarify the advantages of their work over existing generalists like HuggingGPT and Visual ChatGPT, which also aim to be generalist models to some extent.

A: Vitron is not just a combination or tool invocation; it introduces a new concept: the necessity of synergy. This is what differentiates Vitron from other agent-based generalist works, including Visual ChatGPT, HuggingGPT, and LLaVA-Plus. These methods operate like agents, merely invoking external modules through LLM, which is ultimately ineffective because the resultant generalist cannot surpass meta-specialists. To achieve a truly stronger generalist, cross-modal/cross-task synergy is essential, similar to how ChatGPT achieves cross-task synergy in NLP tasks. Synergy is crucial for unlocking the native multimodal emergent capabilities of MLLM.


Q2: The authors propose a hybrid message-passing mechanism and provide a corresponding ablation study. It would be beneficial to have an intuitive demonstration of how discrete textual instructions complement continuous feature embeddings.

A: This concept is straightforward. For simple user queries, textual instructions can successfully invoke the backend module to produce the correct result. However, for more complex user intents or queries, additional supplementary information and features help the backend module understand how to execute the task accurately. For instance, in an image caption task, textual instructions suffice by simply instructing the backend module to "describe the given image". But for a text-to-video generation task, instructions like "generate a video based on the text 'a dog walking in the park'" may need further visual information in the form of feature embeddings, such as details describing the dog or the park.


Q3: GPT-4 is employed to construct text invocation-oriented instruction tuning datasets. Detailed prompt templates and examples should be provided.

A: Thank you for pointing this out. We have exemplified a prompt for a video tracking task in Appendix E.1. We will provide complete prompt templates for all tasks in the revision, and our data will be fully open-sourced.


Q4: Some symbols need clarification, such as the J and F in Table 3.

A: Apologies for not explaining these symbols. For video object segmentation on DAVIS 17, we use the mean Jaccard JJ index and mean boundary FF score, along with mean JJ & FF, to evaluate segmentation accuracy.


Q5: A value bar should be added to intuitively show the synergy across different tasks. Additionally, the calculation of the synergy between tasks should be clarified.

A: We conducted this analysis in Appendix G.3 Cross-task Synergy Study, and Figure 6 provides a visualization of the result. We calculate cross-task synergy by recording Vitron's performance on individual tasks without Cross-task Synergy Learning and after incorporating it, then comparing the performance improvements for the same tasks. Figure 6 shows the normalized improvement visualization.


Q6: What are the current limitations of the unified model? What is the significance of a unified model if the performance of specialists is still not very satisfactory?

A: As emphasized before, achieving true generalist capabilities towards human-level AI necessitates cross-modal and cross-task synergy. We view synergy as a form of generalization ability. Only when MLLM achieves this across different modalities and tasks can we speak of advancing to the next level of MLLM towards human-level AI, similar to ChatGPT's synergy in NLP tasks. However, we have yet to see such generalization capabilities in MLLM (or unified multimodal models). The Cross-task Synergy Learning mechanism proposed in Vitron is a key starting point towards this goal.


Q7: If better specialists are available, how can they be replaced, and what would be the cost?

A: Vitron supports replacing backend specialist modules. However, the entire system might need corresponding retraining and alignment learning. Fortunately, Vitron currently uses the state-of-the-art specialists.


Q8: Are other modalities, such as audio or 3D, being considered for inclusion with fine-grained capabilities?

A: While unifying various modalities, including sound, images, and 3D information, is an important trend (as demonstrated by NExT-GPT), the main focus of this work remains on the vision modality. Nonetheless, achieving more modality unification and synergy is also a significant goal of this work. We hope the proposed Cross-task Synergy Learning mechanism will receive more attention and study in subsequent research.

评论

Thanks for the response. I think most of my questions and concerns have been addressed. Thus, I increase the score to "Strong Accept" rating.

评论

Dear Reviewer #d2kF,

Thank you again for your continued greater support. We appreciate it very much!
We will make the necessary adjustments in the revision.

Thank you once again for your insightful feedback.

Best

审稿意见
5

This paper proposed VITRON, a multimodal generalist that supports a wide range of vision tasks, treating images and videos as a unified entity. It combines discrete textual instructions with continuous signal embeddings for effective function invocation. The model is trained on fine-grained spatiotemporal vision-language alignment to enhance its pixel-level visual capabilities. A synergy module is developed to maximize the sharing of fine-grained visual features across different visual tasks, improving overall performance. Overall, VITRON aims to be a comprehensive system for vision-related tasks, demonstrating strong capabilities across various benchmarks.

优点

VITRON introduces a novel hybrid method for message passing, combining discrete textual instructions with continuous signal embeddings. It proposes pixel-level spatiotemporal vision-language alignment learning, advancing fine-grained visual capabilities. The paper demonstrates extensive capabilities across 12 visual tasks evaluated on 22 datasets, showcasing high performance. It includes a cross-task synergy module that optimizes task-invariant fine-grained visual features, enhancing task cooperation. VITRON represents a significant step towards a unified AI capable of understanding, generating, segmenting, and editing both images and videos.

缺点

The proposed system is complicated but the paper is not well-structured. A clearer exposition of the system’s components and their interactions would greatly benefit the reader’s comprehension.

Literature Review: The paper would benefit from a more comprehensive literature review, particularly discussing related works such as Visual ChatGPT [1], HuggingGPT [2], InternGPT [3], ControlLLM [4], GPT4Tools [5], LLaVA-Plus [6], etc. A comparative analysis with these methods in the experimental section would provide valuable context and benchmarking.

Technical Clarifications: There are several areas in the paper where further clarification is needed.

  • The necessity of both Module name and Invocation command (Line 874) could be better justified. Using Invocation command is enough?
  • Why does Video Segmentation module need Region results? In my opinion, the region should be predicted by the backend models? The role of the Region results in the Video Segmentation module warrants further explanation, as it seems the backend model should predict these.
  • The alignment of Task-specific and Task-invariant Fine-grained Features with the inputs of Backend Visual Specialists is unclear, especially since these specialists are frozen during training. Specifically, how to connect Task-specific Feature and Task-invariant Fine-grained Feature to the backend models? Or the Backend Visual Specialists can work without any input limitation?

Data Format Consistency: The data format inconsistency between Line 880-881 and page 23 should be addressed to avoid confusion about task requirements and tool usage. Does it mean that the tasks in page 23 do not require using tools?

Impact of Features: More insight into how the Task-specific and Task-invariant Fine-grained Features affect the overall system performance would be beneficial.

Baseline Comparisons: Including training-free baselines that utilize GPT-4v with in-context learning and ReACT [7] for tool invocation would provide essential benchmarks for comparison.

Model Results: As this paper claims that “VITRON surpasses existing SoTA specialists’ performance”, It is recommended to present the results of specialist models in the experimental tables as reference performances. Furthermore, As VITRON is positioned as a generalist, comparisons with other generalist methods would be fair enough.

Tool-use Accuracy: An exploration of tool-use accuracy across the 12 tasks covered by VITRON is necessary to validate the method’s efficacy.

Execution of Backend Modules: Clarification is needed on what constitutes the successful execution of backend modules in Figure 4. In addition, why text instruction is more conducive to the successful execution of backend modules while soft feature embedding seems to be more useful in terms of specific task performances?

Generally, the paper’s contributions are incremental. To my knowledge, the single tool invocation is not complicated, especially with the limited scale of toolset. There are several papers can solve tasks by chaining multiple tools.

[1] Wu C, Yin S, Qi W, et al. Visual chatgpt: Talking, drawing and editing with visual foundation models[J]. arXiv preprint arXiv:2303.04671, 2023.

[2] Shen Y, Song K, Tan X, et al. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face[J]. Advances in Neural Information Processing Systems, 2024, 36.

[3] Liu Z, He Y, Wang W, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language[J]. arXiv preprint arXiv:2305.05662, 2023.

[4] Liu Z, Lai Z, Gao Z, et al. Controlllm: Augment language models with tools by searching on graphs[J]. arXiv preprint arXiv:2310.17796, 2023.

[5] Yang R, Song L, Li Y, et al. Gpt4tools: Teaching large language model to use tools via self-instruction[J]. Advances in Neural Information Processing Systems, 2024, 36.

[6] Liu S, Cheng H, Liu H, et al. Llava-plus: Learning to use tools for creating multimodal agents[J]. arXiv preprint arXiv:2311.05437, 2023.

[7] Yao S, Zhao J, Yu D, et al. React: Synergizing reasoning and acting in language models[J]. arXiv preprint arXiv:2210.03629, 2022.

问题

Please refer to Weaknesses.

局限性

Please refer to Weaknesses.

作者回复

We feel honored for your many constructive comments, and appreciate you went through our paper so carefully. Below we try to address your concerns or misunderstandings. If you find our response effective, please consider generously increasing rating.


Q1: The system is complicated, but paper is not well-structured

A: Sec. 3 mainly introduces the framework components, and the three subsections of Sec. 4 further explain three important tuning mechanisms. We will further polish the paper's structure.


Q2: needs a more comprehensive literature review, such as Visual ChatGPT, HuggingGPT, InternGPT, ControlLLM, GPT4Tools, LLaVA-Plus, etc

A: Actually, HuggingGPT and LLaVA-Plus are already included. Other works you listed, as well as the latest works will all be covered in the revision.


Q3: several clarification is needed
Q3.1: The necessity of both Module name and Invocation command (Line 874)? Using Invocation command is enough?

A: It is necessary to consider both the Module Name and Invocation Command. The module name determines which module to invoke, while the command serves as input to that module. Another benefit is that it allows the LLM to explicitly recognize which task module should accurately process the user's input demand.


Q3.2: Why does the Video Segmentation module need Region results?

A: Users may specify a particular area of interest to segment/track. For example, if a user specifies an area on a sketch pad, this area can be directly passed to the backend module. If the user specifies a region of interest using natural language, the LLM itself needs to determine and output this region result for the backend module.


Q3.3: how to connect Task-specific and Task-invariant Fine-grained Features to the backend models?

A: Task-specific Feature and Task-invariant Fine-grained Feature embeddings will be concatenated as one, which is then fed to the feature encoder of the backend module. For example, for the video generation module of diffusion-based ZeroScope, the concatenated feature representation is fed directly to the UNet. But before connecting this feature vector to the backend module, we do alignment between them for sure.


Q4: The data format inconsistency between Line 880-881 and page 23

A: Correct. The prompt in Line 880-881 is used to tune the invocation of tools, while the prompts on page 23 do not involve tool invocation. Therefore, these prompts are not necessarily in the same format.


Q5: how the Task-specific and Task-invariant Fine-grained Features affect the overall performance

A: In Sec G.3, the Cross-task Synergy Study, we provided a general overview of how different tasks utilize the Task-invariant Fine-grained Features and their impacts. We will consider conducting a more detailed study of the separate impacts of these features in revision.


Q6: should include training-free baselines that utilize GPT-4v with in-context learning and ReACT [7] for tool invocation.

A: Thanks. We implemented this experiment. Since GPT-4v doesn’t support video, and also time constraints in rebuttal period, we only implemented one image processing task, referring image segmentation on RefCOCOg.

ValTest
NExT-Chat67.067.0
SEEM65.765.8
GPT-4v + SEEM64.965.1
GPT-4v + ReACT + SEEM65.465.4
Vitron67.968.9

Compared to two pipeline systems, due to our Vitron being trained fully jointly, it performs better than the pipeline—even though the backbone GPT-4v is stronger. As seen, the Pipeline systems did not surpass the backend specialist (SEEM).


Q7: the paper claims “VITRON surpasses existing SoTA specialists”, should present the results of specialists

A: Actually our results (Table 1-11) cover some important SoTA specialists, such as G-DINO for video grounding, and MakeVideo for video generation, where Vitron achieved better performance. But thanks; we will consider covering more specialists and generalists in revision.


Q8: how is tool-use accuracy?

A: Actually this is already shown in Fig 4, specifically in the right-hand of the bar chart (i.e., Execution Rate).


Q9.1: what constitutes the successful backend execution in Fig 4?

A: For successful execution, it is crucial to accurately determine the module name and provide accurate input information, which includes both commands and some necessary features.


Q9.2: why is text instruction more conducive to the successful execution; soft embedding more useful in task performances?

A: The reason is straightforward: Text instruction can only help ensure that the backbone module can select the downstream tasks correctly, but some critical information might not be conveyed through pure text alone. Soft feature embedding, on the other hand, provides very detailed additional information, giving the backbone module a more complete set of input features, which helps to achieve better results for the output tasks.


Q10: contributions are incremental. the single tool invocation is not complicated

A: We believe the most core and valuable aspect of our paper is that Vitron is not simply about combining or invoking tools, but it proposes a new idea—necessitating synergy. This is what distinguishes Vitron from other agent-based generalist works (including Visual ChatGPT, HuggingGPT, and LLaVA-Plus).

These existing systems operate exactly similarly to an agent, which can be quite meaningless, as the resultant generalist cannot surpass meta specialists. To achieve a truly more powerful generalist, it is key to realize cross-modal/cross-task synergy. This is akin to how ChatGPT achieves cross-task synergy in NLP tasks. Synergy is pivotal to unlocking the native multimodal emergent capabilities of MLLM.

评论

Dear reviewer #BLKJ,

Thank you for the comments on our paper.

We have shown the response to your comments. Please let us know if you have additional questions so that we can address them during the discussion period.

We really hope you can consider raising the score when we did a good job in addressing your concern.

Thanks again!

评论

I still have a question that needs to be clarified. The backend modules are frozon as you mentioned, but the embeddings generated by MLLMs will be fed into backend modules. I noticed embedding alignment tuning for decoders in Line 202-205. It seems the alignment between LMMs and decoders just follows the idea from NExT-GPT. This alignment, in my view, is extremely important and should not be explained with just two sentences. I need the authors' clarification on this issue. Generally, I may raise my score, but still question about the novelty of this work.

评论

Dear reviewer #BLKJ,

Thank you for your reply and feedback.

We apologize for not providing a detailed description of this part in our manuscript. This was majorly due to the many system components we have in Vitron and the limited space available, which required us to carefully balance the content.

Here, we provide further clarification and more description for the section on "Embedding-oriented Decoder Alignment Tuning". (Apologies for taking a day to respond; we've been attending a conference and faced some scheduling conflicts.)


Our description in the main text is rather general (hence just a few sentences); however, the alignment strategies for this section are specifically designed according to the functionalities of different backbone modules, thus the design of each module varies slightly.

Overall Design and Principles: We believe that text instructions are simple and efficient, yet explicit textual instructions carry limited information and may lack detail. Therefore, we designed an implicit signal representation to provide additional, detailed information. This is also the key difference between Vitron and NExT-GPT w.r.t. the backend alignment part. Technically, partially following NExT-GPT (yes, we did), we introduce a set of signal tokens, [SIG_1], ... [SIG_N], to guide the downstream task processes. But unlike NExT-GPT, which introduces different signal tokens for different modalities, we utilize the same set of signal tokens for all functional tasks in our Vitron system.

Alignment Learning Design: We have designed different alignment strategies according to the backbone of specialists. Specifically, we have categorized our approaches into two:

  1. Diffusion-based backend module, for tasks like image and video generation, image and video editing. Under this Diffusion architecture, explicit text instruction will be modeled by the text encoder, obtaining explicit instruction representation hexph^{exp}. The implicit signal representation is divided into task-specific features and task-invariant features to capture synergy across tasks, which are then fused/concatenated as the final implicit instruction representations himph^{imp}. Through a projection module, these two types of features (hexp(h^{exp} and himph^{imp}) are concatenated and mapped as the conditional input into the pre-trained UNet of Diffusion. During alignment learning, we aim for the instruction representation to be instruction-specified and detail-enriched, effectively guiding the diffusion-based module to generate high-fidelity images and videos. Our learning objectives include a) diffusion loss, which directly learns the correspondence between instruction representation and the appropriate conditional generation characteristics of the diffusion model, and b) minimizing the distance between the instruction representation and the instruction feature from the text encoder in the text-to-image generation model, enriching the instruction with more details for finer-grained features. However, the backend module remains frozen and does not require updates (as described in Line 202-205 in paper); updates are applied through backpropagation to the signal feature embedding/representation (from LLM).

  2. Non-diffusion-based backend module, for tasks like image segmentation, video segmentation, etc, via SEEM. This model architecture inherently includes a textual encoder for receiving textual instruction and a feature encoder for encoding visual embeddings, separately. Thus, we input text instructions into the textual encoder and the enriched implicit feature himph^{imp} (with more fine-grained details) into the visual feature encoder. Specifically, just like the process above for the diffusion-based backend, we obtain a final, efficient instruction representation. Then, during the alignment learning phase, our objectives include a) the learning target of segmentation itself and b) minimizing the distance between the implicit feature himph^{imp} and the vision feature of the gold segmented image/video.

Our code implementation includes all the technical details, which will be made completely open. We will update this point in the revision (preferably with some figure illustrations, most probably in the Appendix).


Finally, if you have any further questions, we are willing to provide additional responses. We sincerely hope you can increase your score.

Best regards

评论

Dear reviewer #BLKJ,

Please let us know if our response adequately addresses your questions, such that we can have a fair opportunity to make timely replies before the author-reviewer discussion period ends in the next <10 hours from now (August 13th, 11:59pm AoE).

If not, we kindly ask you to consider raising the score.

Thank you!

审稿意见
6

This paper proposes a universal vision LLM, named Vitron, for comprehensive understanding, generating, segmenting, and editing of both images and videos. The model incorporates SOTA visual specialists as the backend to support various visual tasks. A cross-task synergy module is advised to learn to maximize the task-invariant visual features. Experiments demonstrate the effectiveness of Vitron on 12 visual tasks.

优点

  • The paper is well-written and easy to follow.
  • The proposed model integrates multiple visual tasks, including understanding, generation, segmentation, and editing.
  • Experiments show that the performance of Vitron is comparable to SOTA methods of each visual task.

缺点

Some SOTA methods are not included in the experiments. For instance, CM3leon [1] 7B achieves 4.88 FID on COCO-Captions. MAGVIT-v2 [2] reaches 53 FVD on UCF.

[1] Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning, 2023.

[2] Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation, 2023.

问题

Refer to weaknesses.

局限性

The authors have adequately addressed the limitations.

作者回复

We would like to thank you for your time in writing comments, especially for your strong recognition of our paper. Below we provide our response.


Q1. Some SOTA methods are not included in the experiments. For instance, CM3leon [1] 7B achieves 4.88 FID on COCO-Captions. MAGVIT-v2 [2] reaches 53 FVD on UCF.

A: Thank you for pointing this out.

In fact, CM3leon utilizes retrieval-augmented pretraining, which retrieves additional information to help achieve a 4.88 FID on COCO-Captions. For a fair comparison, we should consider the performance of CM3leon without retrieval, which is 10.82 FID on COCO-Captions, and this does not outperform our Vitron's performance (7.57 FID).

As for MAGVIT-v2, we noted that the settings in their paper are different from ours, even though both use the UCF-101 dataset for video generation. MAGVIT-v2 is focused on class-conditional video generation, i.e., label-to-video generation, whereas Vitron is focused on image(frame)-to-video generation, mainly following the baseline DynamiCrafter's setting. Therefore, these scores are not directly comparable. Since MAGVIT-v2 is not yet open-sourced, we are unable to run their model on image(frame)-to-video generation for direct comparison.

However, we will for sure include these two works in the revision and will further cover the latest released relevant works that we failed to include when submitting to NeurIPS.

Additionally, compared to coarse-grained vision generation or understanding, our Vitron system has an advantage in fine-grained vision tasks (such as editing, grounding). Moreover, there is actually ample room for further performance improvement in various tasks, as Vitron has not yet been fully fine-tuned on the relevant datasets.

Lastly, we want to further emphasize the most core and valuable aspect of this paper: Vitron is not just about combining or invoking tools, but proposes a new idea: it must be capable of showing synergy. This is the core of Vitron that differentiates it from other similar Agent-based generalist models, which simply use LLMs to invoke external modules—this is meaningless, as the resultant generalist cannot surpass meta-specialists. To achieve a truly more powerful generalist, it is essential to realize cross-modal/cross-task synergy. This is akin to how ChatGPT achieves cross-task synergy in NLP tasks. Synergy is equally key to realizing the native multimodal emergence capabilities of MLLMs.

So, if you are affirming the value of our work, we hope you can consider improving the score slightly. Thank you very much!

作者回复

General Response to All Reviewers


Dear Reviewers,

We sincerely appreciate the detailed and constructive comments you have provided on our work. We are fully committed to integrating your suggestions into our revision process. We feel very encouraged that the reviewers find our work novel, interesting, and intuitive. Your support is deeply appreciated.

Here we would like to re-emphasize the significant and distinctive contributions of this work. We believe three keywords can well represent this project: Unification, Fine-grained Vision, Synergy.

  1. Unification. The research on MLLMs is currently very vibrant and rapid. Therefore, future research on MLLMs will inevitably develop towards a more unified direction, i.e., to cover more modalities in breadth and achieving more powerful task processing capabilities in depth. This work focuses more on the latter and specifically on vision MLLM. For the first time, this paper proposes a grand unified vision MLLM, Vitron, handling understanding, generating, segmenting, and editing of both images and videos in a "one-for-all" manner.

  2. Fine-grained Vision. In the visual MLLM community, initially the visual MLLMs could only support coarse-grained, instance-level understanding or generation. However, a key problem here is object hallucination due to the lack of pixel-grounding capabilities. Subsequently, more advanced visual MLLMs that support fine-grained, pixel-level visual understanding and generation (visual grounding, segmentation, editing, and more) are emerging. We devises a hybrid instruction-passing mechanism and various pixel-level vision-language spatiotemporal alignment learning, enabling the proposed Vitron model to support comprehensive and powerful fine-grained visual processing capabilities.

  3. Synergy. Vitron achieves visual generalist, but not simply through tool invocation (like an agent); rather, it emphasizes achieving synergy across multiple modalities and tasks. We can understand synergy as a type of generalization ability, where only when an MLLM achieves generalization across different modalities and tasks can we say we have reached the next level of MLLM towards human-level AI. This is similar to how ChatGPT achieves cross-task synergy in NLP tasks. However, we have not yet seen such generalization capabilities in existing MLLMs. The Cross-task Synergy Learning mechanism proposed in Vitron is a key starting point toward that goal. This is also the most core and valuable aspect of this paper. We hope this work will inspire more related research to further explore this mechanism.

We will fix all the possible issues and improve the manuscript. To address all your concerns and questions, we have prepared a comprehensive response, including additional experiments where necessary. If you have any further questions or feedback, please don't hesitate to interact with us. We are more than willing to provide clarifications and address any additional concerns you may have.

Best regards

最终决定

Overview

The submission introduces a unified framework to achieve a "one-for-all" multimodal model capable of understanding, generating, segmenting, and editing both images and videos at various levels of granularity, from image to pixel level. During the discussion, the authors emphasized three core aspects of their work: Unification, Fine-grained Vision, and Synergy. According to the reviews, the paper is well-supported by comprehensive experiments demonstrating the efficacy of the proposed method.

According to the discussions between authors and reviewers, there are several strenghts that are commonly shared by reviewers:

Novel concept: The idea of unifying a range of visual tasks under a single MLLM framework is innovative and timely, given the rapid advancements in multimodal models. The introduction of a cross-task synergy learning mechanism to enhance generalization across tasks is particularly noteworthy.

Comprehensive framework: The proposed framework integrates multiple backend visual specialists and combines discrete textual instructions with continuous signal embeddings for effective task execution. This hybrid instruction-passing mechanism is highlighted as a key technical contribution.

Extensive experiments: The authors conducted thorough experiments across a wide range of tasks, demonstrating that Vitron achieves comparable or superior performance to state-of-the-art (SOTA) methods. The inclusion of adversarial training to decouple task-specific and task-invariant features further strengthens the framework's robustness.

However, there are also a few weaknesses that make the reviewers not reach a great consensus on acceptance:

Paper presentation: Several reviewers noted that the paper's structure could be improved for better clarity, particularly regarding the detailed explanation of the system's components and the alignment strategies for embedding-oriented decoder alignment tuning. The authors’ rebuttal did clarify these aspects, but the initial presentation was lacking in some areas.

Comparison with previous works: There were concerns about the novelty of the work, particularly in comparison to existing generalist models like HuggingGPT, Visual ChatGPT, LlaVA-Plus, etc. While the authors argue that their focus on synergy sets their work apart, some reviewers felt this distinction was not sufficiently emphasized or validated in the original submission.

Novelty: A recurring theme in the reviews was the question of novelty. Despite the authors' detailed rebuttal, some reviewers still felt that the contributions were incremental, particularly when compared to related works. This led to a slight decrease in ratings by some reviewers.

Final Recommendations

Overall, the main strengths of the paper lie in its novel approach to unifying visual tasks and the extensive experimental validation. However, the concerns regarding clarity, presentation, and the novelty of the work remain. Given the innovative nature of the work and its potential contribution to the field, it is recommended for acceptance, provided that the authors significantly improve the presentation, especially by clarifying the unique aspects of their contributions in comparison to existing models. The authors should include more detailed discussions and illustrations regarding the significance of the work compared with previous related works in the final version to address the raised concerns fully.