PaperHub
5.4
/10
Poster5 位审稿人
最低5最高7标准差0.8
7
5
5
5
5
4.2
置信度
正确性3.0
贡献度2.6
表达3.0
NeurIPS 2024

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06
TL;DR

a large multimodal model with versatile vision-centric capabilities

摘要

关键词
Large Multimodal ModelsVision-Centric Capabilities

评审与讨论

审稿意见
7

This paper introduces Lumen, a large multimodal model designed for versatile vision-centric tasks. Lumen features a decoupled architecture that initially learns a shared heatmap representation to address various vision-related tasks such as object detection and instance segmentation. Subsequently, this representation is decoded to perform these tasks seamlessly. Additionally, Lumen maintains robust capabilities in general-purpose visual question answering. Experimental results across various benchmarks demonstrate Lumen's superior performance compared to existing large multimodal models (LMMs), highlighting the effectiveness of the proposed method.

优点

  1. This paper has a good motivation of hiding inductive bias of each individual visual task introduced by specific output formats with the unified heatmap representation. This design is meaningful when using data from both vision-centric and VQA tasks to jointly finetune a LMM. Unlike the conventional instruction data (e.g., Pix2Seqv2, Griffon, etc.) converted from visual tasks, which reveal the detailed vision-centric task output formats (e.g., box coordinates for object detection, vertex coordinates of polygon for instance segmentation), the reformulated instruction data in this paper are less contaminated by massive in amount but lack in semantic visual task output knowledge. Consequently, the proposed method enhances visual perception capabilities without significantly compromising the LMM's general-purpose conversational abilities.
  2. The two-phase training strategy is reasonable and effective. As the authors do not leverage decoding techniques specifically designed for dense perception visual tasks (e.g., setting multiple object queries and optimizing with Hungarian matching in object detection task), the convergence speed in learning the heatmap is relatively slow. Therefore, mixing the VQA data in this phase is uneconomic. And in the second phase, the authors reinforce the conversational capability by jointly tuning with VQA data and vision-centric data.
  3. The experiments are sufficient and well-organized. The proposed Lumen achieves superior performances than exiting LMMs on vision-centric tasks, and exhibits comparable VQA abilities with LMMs proficient in conversations like LLaVA and Qwen-VL.

缺点

This paper has no major flaws, but I have some problems concerning the paper writing and model designs:

  1. The organization of this paper should be reconsidered. The training recipe is crucial to reveal the development of conversational and dense visual perception capabilities, and ablations on the training phases are also discussed in the main paper. However, the detailed training recipe is included in the Appendix. Although the authors provide relevant pointer in the main paper, I think this arrangement of paper content is unsuitable. The detailed training recipe should be integrated into the main paper to ensure a clearer and more coherent presentation of the methods and results.

  2. The inference speed is a concern. Since Lumen can only generate one heatmap by forwarding the model once, I am wondering the cost when evaluating the dense visual perception tasks as it needs to individually feed every class name into the model.

  3. I am wondering the accuracy of predicting task-specific tokens (e.g., [DET]/[SEG]) as routers? And if I modify the task prompts to more formats, can the model determines the correct task type? More discussions and proofs are suggested to be provided.

问题

Please refer to weakness

局限性

The authors have adequately addressed the limitations in the Appendix materials.

作者回复

Q1: The origanization of the paper

A1: Sorry for the confusion caused by the paper organization, we will reorganize the paper following your suggestion in the revised version.

Q2: The inference speed of the model

A2: We have discussed the inference speed of our model in the Appendix.D (see L646~L654). We acknowledged that the repetitive model forwarding required by generating heatmaps of different categories will increase the inference burden. Therefore, we parallize the inference procedure using the batch processing technique, and the inference time cost is lower than Griffon, another LMM-based method customized for solving object detection task as discussed in Appendix.D.

Q3: The prediction accuracy of routing tokens

A3: As you suggested, we first modify the task prompts as below.

Original Instruction: "Please find the location of {description}. Respond with {format}."

Modified 1: "What are {description} in the image. Please generate their {format}."

Modified 2: "Generate {format} of {description} in the image."

Afterward, we radomly sample 100 data from each of five special token-related tasks, resulting a testing set with 500 samples in total. Then, we prompt the Lumen with the three types of instructions illustrated in the cell above, and calculate the accuracy of generated special tokens. The accuracy results are reported below:

OriginalModified 1Modified 2
Acc.1.01.01.0

The results indicate that our model is not sensitive to the change of instruction and can generate the correct task routing token.

评论

Thank the authors for the rebuttal. After carefully reading the authors’ rebuttal, my comments have been clearly addressed. Specifically, the authors test special token prediction accuracies of modifying the template instructions into different formats, which is my major concern, as MLLM may overfit to some fixed instruction templates loses generic instruction-following ability. And the results in the rebuttal demonstrate the robustness to diverse visual task oriented instructions, facilitating the proposed Lumen to flexibly switch between conversation and visual perception functions using natural language commands.

Also I read other reviewers’ comments as well as the authors’ rebuttal information. For example, the comparison with other LMMs, the additional evaluation on more complex benchmarks, as well as the role of each module. In the authors’ rebuttals, they provide systematic analysis (e.g, lack of order and semantics of visual tasks), supporting proofs (e.g, dense perception ability of GPT4-V), and comprehensive evaluations on more benchmarks (e.g, Objects365, ReferIt, ViPBench, etc) to consolidate the value of their decoupling design in keeping LLM-like QA function while extending several dense visual perception abilities. In my opinion, the comments have been addressed. And I also recommend the authors to integrate these additional analysis and experimental results suggested by other reviewers in the revised version. Overall, I think the paper's novelty in decoupling vision tasks learning and its convincing performance across both vision and vision-language tasks does have merits for acceptance. I will maintain my original score.

评论

Dear reviewer#ecgp:

Thanks for your precious comments on our paper. We will incorporate the valuable suggestions from you and other reviewers into the revised version.

审稿意见
5
  1. This study proposes a new Large Multimodal Model architecture(Lumen) that prioritizes a vision-centric approach and addresses the limitations of current methods, which oversee the unique features of various visual tasks.

  2. It separates perception capability learning into task-specific decoders, leading to a shared representation for multiple tasks, significantly optimizing task-specific decoding.

  3. Comprehensive experiments prove their method outperforms existing LMM-based approaches in various vision-centric tasks and VQA benchmarks and maintains crucial visual understanding and instruction following capabilities.

优点

This study demonstrates the author's solid understanding of traditional perceptual tasks and their technical details and seamlessly integrates it with the latest MLLMs. While preserving the capability of the instruction following inherent to MLLMs, it incorporates a robust perceptual module.

缺点

  1. The most significant issue of lumen is that it may impede the model's ability to scale up. Though get stronger perception performance by solving various visual perception tasks with extra task decoders, thus disrupting the emergence of a stronger capacity for visual reasoning and perception when it is trained with a lot of data. As validation, none of the top research teams developing Multimodal Large Language Models (MLLMs) meticulously design their visual decoders in this fashion.

  2. The proposed novel technical in the article, such as V-L Dense Aligner and Peak Point Selection, are commonplace in traditional perception tasks, mirroring the early fusion and language-guided query selections in [1]. This work appears to be an enhanced version of models like [2,3], with the introduction of dense query[4] to boost perceptual performance and the addition of more task decoders

[1]. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

[2]. LISA: Reasoning Segmentation via Large Language Model

[3]. NExT-Chat: An LMM for Chat, Detection and Segmentation

[4]. Dense Distinct Query for End-to-End Object Detection

问题

It appears that these perceptual tasks have merely been integrated into the MLLM in some fixed instruction formats. Are there any experiments or examples to demonstrate the new capabilities the MLLM has gained or the capabilities that have been enhanced by integrating these perceptual data? such as stronger fine-grained perceptual abilities or reduced hallucinations?

局限性

None

作者回复

Q1: The design of Lumen

A1: As discussed in the Common Question Q1, it is nontrivial to extend the LMM with versatile vision-centric capabities. We further provide proofs below to further validate this claim:

  1. Compared with Griffon that refactorizes box coordinates into language answers, our Lumen surpasses it significantly even without meticulously curating high-quality conversations with the aid of powerful GPT-4v as training resources as demonstrated in Tab.1 of the main paper.
  2. In the attached PDF file for the rebuttal, we show that even the most powerful LMM, GPT4-v, can not tackle versatile vision-centric tasks well.

Q2: Comparison with mentioned works

A2: There might be a misunderstanding. Our major contribution is that we use the heatmap as the effective intermediate representation to prevent the LMM being trapped by the inductive biases of different vision-centric tasks (i.e, the formulation of boxes, masks and keypoints), rather than the designs of specific modules. And we discussed the benefits of our model in A1 to Reviewer kab5.

Besides, both LISA and NExT-Chat do not prove their dense visual perception abilities like object detection and instance segmentation, while our Lumen can tackle these tasks benefitted from the heatmap representation. And the behavior of our heatmap is not similar to the mentioned Dense Distinct Query (DDQ). DDQ aims at selecting dense object queries distictive from each other from the image feature map to serve for better one-to-one label assignment process for object detection task. As the comparison, our heatmap serves as a unified representation for versatile vision tasks and is optimized without involving any label assignment process (e.g, Hungarian matching) as discussed in Appendix.C.4 (see L641~L645).

Q3: New capabilities enhanced by integrating perceptual data

A3: The integration of vision-centric data can promote fine-grained visual content comprehension ability in VQA task as demonstrated by comparing "FP-C" metric in #2 and #3 of Tab.5, and relevant explanations can be referred to L281~L284 of the main paper. Besides, following your suggestion, we also test our Lumen and the baseline method (i.e, reimplemented LLaVA-v1.5-7B) on HallusionBench[1] as shown below:

qAccfAccaAcc
Baseline12.719.147.0
Lumen14.921.748.7

The results prove that utilizing vision-centric data can also help our model reduce the hallucination.

  • [1] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
审稿意见
5

The paper introduces Lumen, a novel architecture for Large Multimodal Models (LMMs) that enhances vision-centric capabilities by decoupling task-agnostic and task-specific learning stages, thereby improving performance across various vision tasks while maintaining general visual understanding. This architecture allows Lumen to achieve or surpass the performance of existing LMM-based approaches on a range of vision-centric tasks, including object detection, instance segmentation, and pose estimation, while preserving its ability to follow instructions and understand visual content.

优点

The paper suggests that the reformulation of visual tasks into a matching problem and the use of a heatmap for guiding decoding processes are effective strategies for enhancing LMMs.

The paper details comprehensive experimental results, ablation studies, and discussions on the model's generalization capabilities, demonstrating the effectiveness of Lumen in handling diverse visual tasks with minimal training efforts.

The proposed Lumen not only performs well on vision-centric tasks but also maintains general visual understanding and instruction-following capabilities.

缺点

The proposed method does show much stronger performance on VQA benchmarks than prior methods, as shown in Table2.

How is the proposed method of decoupling task-agnostic and task-specific learning stages better than one-stage methods?

问题

See Weaknesses above.

局限性

Have been discussed in appendix.

作者回复

Q1: VQA performances compared to prior methods

A1: Since our major motivation lies in extending versatile vision-centric capabilities while keeping the general-purpose coversational capabilities of the LMM, we do not meticously select the VQA data for training at present. Thanks for your suggestions, we will continue to improve the VQA capabilities of our model by upgrading the VQA training resources in the future work.

Q2: Comparison between the decoupled learning strategy and the one-stage one

A2: Decoupling task-agnostic and task-specific learnings facilitate the LMM to focus on learning fine-grained multimodal content comprehension, rather than being trapped in learning the diverse specialized decoding rules (e.g, the formulation of bounding boxes, instance masks and keypoints) introduced by vision-centric tasks. In contrast, the one-stage learning paradigm requires the LMM to simultaneously learn these decoding rules lacking in semantics and human conversations requiring semantic comprehension and logic reasoning, which is much more challenging than the decoupled one proposed in our paper.

评论

Thank for the response from the authors, which has properly address my previous concerns. I tend to keep my original rating. The authors are encouraged to include the related discussions in the final version.

评论

Dear reviewer#bt1Q:

Thanks for your precious comments on our paper. We will integrate the related discussions in the final version as you suggested.

审稿意见
5

The paper proposes Lumen, a new LMM architecture that splits perception learning into task-agnostic and task-specific stages. Lumen finely aligns vision-language concepts, creating a shared representation for all visual tasks. This is then decoded with lightweight task-specific decoders requiring minimal training. Lumen outperforms existing models on vision-centric and VQA benchmarks, maintaining strong visual understanding and instruction-following abilities.

优点

  1. The proposed Lumen model introduces an effective architecture that separates the learning of perception capabilities into task-agnostic and task-specific stages. This separation promotes fine-grained vision-language concept alignment, enabling Lumen to handle diverse vision-centric tasks more effectively.

  2. Lumen's ability to match or exceed the performance of existing approaches on a range of vision-centric tasks while maintaining general visual understanding and instruction-following capabilities demonstrates its potential for broad impact.

缺点

  1. Limited Novelty of the Method. The main approach of the paper, which involves decoupling task-agnostic and task-specific learning stages and sending the [LOC] token output by the LLM into the Aligner as a guide, is relatively simple and lacks significant innovation. Similar concepts have already been explored in previous works such as Uni-Perceiver v2 and VisionLLM. The paper lacks more groundbreaking improvements.
  2. Insufficient Evaluation of Generalization to Unseen Tasks. The paper only demonstrates Lumen's generalization ability to the unseen object counting task, and only provides qualitative results. The limited evaluation is insufficient to showcase the model's ability to adapt to a broader range of unseen tasks, requiring more experiments for quantitative analysis. Moreover, a large portion of Table 1 is blank, which reduces the advantage of the proposed method compared to other baselines.
  3. Limited Benchmark Diversity. The experiments mainly focus on a few benchmarks such as COCO and RefCOCO, with the only unseen dataset being PASCAL VOC2007. Including a broader range of datasets, especially those involving different modalities and more complex tasks, would better demonstrate Lumen's capabilities in handling vision tasks.

问题

  1. Although Lumen is proposed as an LMM Generalist, its object detection performance lags significantly behind that of Task-specific Specialists or Vision Generalists. Therefore, is there a gap between using [H] and [W] tokens in the Aligner for regressing bounding boxes and using an additional decoder independently? Can the effectiveness of the [LOC] token guidance be demonstrated in a more advanced DETR-like object detection decoder to improve detection performance?

  2. Can the authors provide a more intuitive and specific explanation of the decoding paths in the model, such as the method for selecting peak points and decoding bounding boxes?

局限性

Yes.

作者回复

Q1: Comparison with UniPerceiver-v2 and VisionLLM

A1: There might be a misunderstanding. Our major contribution is that we use the heatmap as the effective intermediate representation to prevent the LMM being trapped by the inductive biases of different vision-centric tasks (i.e, the formulation of boxes, masks and keypoints), rather than the module designs themselves. And the heatmap brings the following benefits that UniPerceiver-v2 and VisionLLM do not possess:

  1. Refelcting dense activation of the open-ended user instruction, heatmap can be seamlessly adapted to vision-centric tasks like object detection, pose estimation, with minimal training efforts. In contrast, Uni-Perceiver-v2 involves complex box and mask proposals generation during its visual encoding stage. VisionLLM uses a set of pre-defined query templates (100 in their implementation) like <cls><x1><y1><x2><y2>, and conducts label assignment with the aid of Hungarian matching to handle the unordered nature of dense objects. And both UniPerceiver-v2 and VisionLLM cannot support point-level task like pose estimation.
  2. Since heatmaps hide inductive bias of different vision tasks, the general-purpose conversational capability, which is the foundation of a LMM, is not affected when incorporating a large amount of vision-centric data during training as shown in Tab.2 of the main paper. As a comparsion, both UniPerceiver-v2 and VisionLLM do not comprehensively evaluate their VQA abilities.

Q2: Further generalization ability evaluation and explanation to Tab.1

A2: As you suggested, we first provide the quantitative results of our model on the FSCD-LVIS[1] test set for the object counting task below:

ModelMAE(↓)RMSE(↓)
FamNet60.5384.00
Attention-RPN64.3164.10
Lumen (Ours)58.2357.64

Besides, we also evaluate our model on another unseen task ViPBench[2] to validate our model's generalization ability. Please refer to A3 for more details.

As for Tab.1, the gray cells indicate the current method does not support corresponding tasks as addressed in the caption of Tab.1. Meanwhile, we have discussed in Common Question Q1 that it is NOT easy to extend a LMM to versatile vision-centric tasks. On the other hand, by comparing the performances of our Lumen on each task with specialists and generalists along the same column, the capabilities of our model in solving versatile tasks can be validated.

Q3: Further evaluation on more diversed benchmarks

A3: Thanks for your advice. to demonstrate our model's capabilities in handling vision tasks, we further evaluate our model on more benchmarks of vision-centric tasks, including Objects365-v1 val set for object detection, AIC[4] val set for pose estimation, and ReferIt[3] test set for visual grounding as shown in the table below:

MethodObjects365AICReferIt
Faster-RCNN (OD)19.6--
HRNet (Pose)-32.3-
TransVG (VG)--70.7
Lumen (Ours)20.430.179.0

where "OD" and "VG" are short for object detection and visual grounding, respectively, representing the type of the specilist model. Since most of existing generalists (e.g, UniPerceiver-v2, mPLUG-v2, Griffon, etc.) only provide detection and grounding results on COCO and RefCOCO(+/g) benchmarks, respectively, we can only list available performances of specialists for comparison in the above table due to time limit. The results indicates that our model can also handle vision tasks on more diversed benchmarks.

Besides, we also evaluate our model on another challenging task ViPBench[2], which requires the model to densely comprehend instances highlighted by visual prompts before answering the questions.

MethodSynthesized visual promptsVisual prompts from human
InstructBLIP31.733.3
Qwen-VL39.241.7
LLaVA-v1.541.640.2
Lumen (Ours)44.545.2

As demonstrated in table above shows, our model signficantly outperforms other methods due to the enhanced dense visual perception capabilities.

  • [1] Few-shot Object Counting and Detection
  • [2] ViP-LLaVA:Making Large Multimodal Models Understand Arbitrary Visual Prompts
  • [3] ReferItGame: Referring to Objects in Photographs of Natural Scenes
  • [4] AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

Q4: Discussions on the DETR-like decoder

A4: We have discussed the DETR-like decoder selection in Appendix.C.4 (see L641~L645), where we acknowledged that a DETR-like decoder can promote the object detection performances because of its Hungarian matching-based label assignment. However, since our main contribution lies in extending versatile vision-centric capabilities for the LMM with less training efforts while keeping its general-purpose conversational capability, rather than specifically optimizing the object detection task, we do not select a DETR-like decoder in our model.

Q5: Explanation to decoding paths

A5: Regarding the peak point selection operation, the basic workflow can be referred to Sec 3.2 (see L186~L190) of the main paper. For its code implementation, we use torch.nn.MaxPool2d with kernel size of 3 and stride of 1 to efficiently select peak points.

Regarding the lightweight box decoder, we have illustrated its workflow in Fig.5 of Appendix.A.1, as well as detailed descriptions in L555~L564.

评论

Thanks for the detailed response. The reviewers addressed some raised points and added additional experiments supporting their approach. Based on their feedback I will increase my score and am looking forward to reading your revision in future venue.

评论

Dear reviewer#kab5:

Thanks for your precious comments on our paper. We will incorporate the valuable points from our discussions into the revised version.

审稿意见
5

The paper proposes to augment the output of a Large Multimodal Model (LMM) with three tasks, namely detection, segmentation, and pose estimation. To that end, Lumen first predicts heatmaps conditioned on the encoded input image and instruction. This heatmap is then used by task-specific decoders to predict grounded bounding boxes, segmentation, and poses.

优点

  • The analysis of pre-training tasks can be insightful and useful for the community. In general, the ablation section can be useful to the community, though it has many weaknesses (see below).
  • Augmenting the output of an LLM with dense vision outputs is a useful research direction to explore

缺点

  • Some claims are not well-supported (L74-80)
    • (1): it is not clear what it means to “unleash the vision-centric potential of [an] LLM…”. The method still uses the “V-L Dense Aligner” and trained box and semseg mask decoders to make the prediction. The value of an LLM in the proposed approach is not ablated properly. While Tab. 3 c) gives a hint towards that question, one improvement (1.0->1.0*) is actually a better vision encoder, and the other (1.0*->1.5) might be explained with more instruction-tuning data.
    • (2): “Adapt to tasks …. w/o requiring specialized datasets”
      • Fig. 3 suggests that specialized datasets for each task are used.
    • (3): “exceeds existing llm-based approaches.”
      • Most “LMM generalists” just don’t do most of the tasks
      • Differences with [12, 59] are not discussed to make sense of the results. What exactly differs there? Why these are reasonable baselines, e.g., [59] is a diffusion-based one.
      • Similarly, comparisons in Tab. 6 lack contextualization, e.g. [13, 59]. What makes the proposed model generalize to an unseen dataset? What is missing from the baseline from [41] that benefits from a similar heatmap representation?
  • The ablation section lacks details and proper discussion, albeit being one of the most useful sections of the paper:
    • Training “phases” are first mentioned in L267 but never described properly. L267 suggests 1st phase is task-agnostic and 2nd is, presumably, task-specific. However, A.3 suggests that it’s only about training data sampling.
    • The implication in L290 for Tab. 4 a) is unclear. What does it mean to have “complete vision-language interaction”, and why does the convolutional architecture not have it (or cannot have it)?
    • Similarly, the conclusion in L296 is unclear. Is the proposed framework not already improving the “dense vision-language alignment”?
    • In Tab. 5, it is not clear why FP-C for 2 is dropped compared to 3. According to A.3, all datasets are present during this stage. Why do the authors think separate stages are needed to achieve better performance? Is it longer training?
  • Minor: There should be a justification why these 3 tasks are “fundamental.”

问题

  • L267 “...10.000 iterations in the first phase…”: it is not clear how it compares to the previously obtained results.
  • Why are the RC SAM results in Tab. 4 b) better than in Tab. 1?
  • How are predictions for Fig. 4 obtained? Is it done by counting the peak points, or does an LLM make the prediction? This should be stated clearly to avoid confusions.

局限性

Limitations of the work are not discussed. For example, what are the limitations of using the heatmap as a “powerful intermediate representation” (L313)?

作者回复

Q1-1: The value of the LLM

A1-1: In the workflow of Lumen, the LLM plays a crucial role in jointly comprehending the open-ended instruction and visual content (which is LLM's strength), and its understanding of the inputs is further condensed into the hidden state of the [LOC] token. The V-L dense aligner and task-specific decoder are responsible for efficiently translating such high-level information into the heatmap space and task output space. Experimentally, the LLM's rich knowledge and strong understanding of multimodal inputs are important for tackling vision-centric tasks as verified below:

  1. By comparing (1.0*->1.5) in Tab.3c you mentioned, it can be proved that enhanced comprehension ability of LLM can promote vision-centric task performances.
  2. As shown in Tab.6, LLM also facilitates our model to generalize to unseen categories well, where the general-purpose content comprehension capability of the LMM plays a crucial role.

Q1-2: The meaning of "specialized datasets"

A1-2: We refer to the "specialized datasets" as the dataset meticuously curated in Griffon (L78), which uses GPT4-v as the data engine to generate dialogues customized for object detection. In contrast, we only sample template questions and answers from pre-defined templates as shown in Fig3, which requires much less efforts.

Q1-3: Comparison with existing LMMs

A1-3: For clarity, we respond your concerns as below:

  1. Existing LMM generalists can NOT be easily extended to versatile vision-centric tasks. Please refer to Common Questions Q1 for more details.
  2. We use Griffon and InstructCV as comparable baselines because they share similar motivations with us in extending vision-centric capabilities for foundation models in the instruction-following manner. To achieve this goal, they focus on curating specialized datasets with the aid of an external data engine (i.e, a powerful LLM). In contrast, we hide the inductive bias in vision tasks with the proposed decoupled learning paradigm during tuning the LLM.
  3. We use InstructCV and Pix2Seq-v2 as comparable baselines in Tab.6 as they are also generalist models similar to our Lumen. As discussed in A1-1, the multimodal comprehension capability of LLM facilitates generalization, and therefore the specialist model (i.e, [41] you mentioned) cannot generalize well even with a heatmap representation.

Q2-1: Explanation to the training phases

A2-1: There might be a misunderstanding. Both phases belong to task-agnostic stage as addressed in Appendix A.3 (see L582). In the phase 1, we focus on promoting dense instance perception ability by using vision-centric data only. In the phase 2, we further add VQA data to reconcile the learned dense perception ability and general-purpose conversational ability. Since most parts of task-agnostic decoding don't need training, we only train the lightweight box decoder on COCO after the phase 1 and 2. Sorry for the confusion, we will clarify this in the revised version.

Q2-2: Explanation to the "complete vision-language interaction"

A2-2: We use Tab.3a to prove comprehensive V-L interaction benefits heatmap generation. Since the V-L dense aligner is not our major contribution, we implement with a lightweight transformer although stacked convolutions can also promote such the interaction.

Q2-3: Explanation to the conclusion in L296

A2-3: We use Tab.3b to prove that the pretrained mask decoder is not the major factor that influences the model performance. By further synthesizing the results of Tab.3a, where the performances decrease evidently due to heatmap degradation, we make such the conclusion that heatmap primarily influences the model performance more evidently than the mask decoder.

Q2-4: Explanation to "FP-C" changes and sparate stages

A2-4: By referring to the role of phase 1 stated in A2-1, "FP-C" drops without phase 1 is because of decreased dense instance perception ability. The motivation of sparate phase training has been addressed in A2-1.

Q3: Explanation to the "fundamental"

A3: These 3 tasks provide scene parsing capabilities from different granularities (i.e, object-level, part-level and pixel-level), which can serve as the foundation for downstream tasks (e.g, human tracking, robot planning, etc). Thus, these 3 tasks are widely explored in previous generalists like Pix2Seq-v2, InstructDiffusion, etc.

Q4: Effects of training less iterations

A4: We have discussed the effects of training iterations in Tab.9d of Appendix C.4 (see L636~L645), which indicates that results of training 50,000 steps are better than those for 10,000 steps.

Q5: Explanation to results comparison between Tab.3b(Maybe a wrong table id here as Tab.4 has no subtables) and Tab.1

A5: The results of "Refer Seg." in Tab.1 are evaluated on RefCOCOg as mentioned in L257~L258, which should be corresponded to the "RCg" in Tab.3b rather than "RC". And since we reduce the training iterations for efficient ablations as mentioned in L267~L268, the result of SAM RCg in Tab.3b is lower than the one in Tab.1.

Q6: Explanation to Fig.4

A6: We first render the peak points selected from heatmaps into the image, and then feed this image to the Lumen again with an instruction of "How many green points in the image?", and the resulting answers are used as the prediction results. We will clarify this in the revised version.

Q7: Discussion of limitations

A7: We have discussed the limitations of heatmaps in the Appendix.D (see L646~L654), i.e., heatmaps for different categories should be generated by forwarding the model separately, which can affect inference speed when evaluating vision-centric tasks like object detection. Therefore, we accelerate this process using parallel techniques.

评论

I thank the reviewers for their response, which addresses some of my questions. I also agree that the problem of augmenting the output of VLMs with other modalities/tasks such as segmentation or detection, is a valuable research direction. However, I believe the rebuttal does not address my main concerns: it does not show specific (experimental) evidence for the value of using LLMs in the pipeline and does not establish a clear and comparable experimental setting between different “instruction tuning” methods. Overall, this limits the value of the work beyond providing a useful model artifact, however, based on the other reviewer’s comments there seems to be significant interest by the community. For that reason, I will increase my score. Below, I provide more specific comments.

A1-1:

Experimentally, the LLM's rich knowledge and strong understanding of multimodal inputs are important for tackling vision-centric tasks as verified below.

The comparison lacks a non-LLM-based baseline to ablate the importance of having an LLM. The improvement in 1.0* -> 1.5 can still be explained by other factors (e.g., large instruction tuning dataset.) Generalization to unseen categories is a language property that can be achieved by other means, e.g., using CLIP embeddings.

A1-2: I agree that the curation process might be less cumbersome, but the dataset is still specific for each task; that is, given a new task, one would need to define a new specialized dataset.

A1-3: It is still not clear how the claim “exceeds existing llm-based approaches” is supported in the absence of comparisons. It is still not clear what exact statement is made by this comparison. There are many differences between compared models, which does not allow to make a conclusion if it’s the proposed approach that makes the difference. For example, did the authors equate the data? "... the multimodal comprehension capability of LLM facilitates generalization ...". It appears to be a hypothesis that lacks specific experimental evidence.

A2-2: It is not clear then what claim is made with this experiment. The result doesn't support the claim made in L290.

A2-4: It is still not clear why phase 1 is necessary given that the vision-centric data is also present during phase 2. As per A2-1, the difference between phases 1 and 2 is the addition of VQA data.

评论

Dear reviewer #AJBg,

We sincerely appreciate your time and valuable feedback on our paper. Regarding the raised concerns, we would like to make further discussions with you as below:

Q1-1: Experiments to prove the value of the LLM

A1-1: Regarding your concerns on whether experimental results in the paper (i.e, Tab.3c and Tab.6) can prove the value of LLM, we make further explanation as below:

  1. We agree that the differences between v1.0* and v1.5 in Tab.3c primarily lie in the advanced VQA data curated by LLaVA-v1.5. By training with these data, the LMM obtain more advanced multimodal content comprehension ability. And the enhanced object detection performance of v1.5 compared with v1.0* in Tab.3c prove that the LMM's multimodal comprehension is a general capability that also benefits vision-centric task. Therefore, we employ the results comparison between v1.0* and v1.5 in Tab.3c to verify the value of LMM on vision-centric task. Meanwhile, a similar phenomenon has also been observed in [1], where the LMM with enhanced multimodal comprenhension ability also performs better on the visual grounding task.
  • [1] Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
  1. When evaluating on unseen categories, the two compared generalist models can be the baseline methods you suggested. Specifically, Pix2Seq-v2 is a vision-generalist that does not involve any LLM utilization, which can be the non-LLM baseline as you suggested. Although InstructCV leverages a LLM to generate instruction data for training a text-to-image diffusion model, its generalization ability is mainly inherited from the CLIP-based image and text encoders of diffusion models, which can be the baseline that uses CLIP embeddings for language comprehension as you suggested. Therefore, by comparing with these two generalists, the significantly enhanced generalization performances of our Lumen in Tab.6 can prove the value of LLM in terms of generalization.

Q1-2: The narration of "specialized datasets"

A1-2: Thanks for your detailed suggestion. We will rephrase the "specialized datasets" as "meticulously curated specialized datasets" in the revised version.

Q1-3: Further explanations on comparison with existing LMMs and generalization facilitated by LMM

A1-3: Firstly, strictly aligning with other LMMs in all aspects is currently intractable, as different LMMs are trained with their respective data and training recipes (please refer to the data composition and strategies of training Qwen-VL and LLaVA for more detailed comparison). Secondly, all training data we used are publicly available and also used in the LMMs we compared, such as RefCOCO, RefCOCO+, and RefCOCOg, which are utilized in both Griffon and Shikra. Besides, for tasks that existing LMMs do not support and cannot be seamlessly extended to, as discussed in our initial rebuttal, we are unable to apply the corresponding datasets to the LMMs we compared. Overall, our Lumen not only achieves better or comparable performance on tasks that existing LMMs can handle, but also extends to novel vision-centric tasks.

For the proof of the generalization facilitated by LMM, generalization evaluation in Tab.6 supports our claim as discussed in A1-1.2 of this dialog box.

Q2-2: Explanation to claim in L290

A2-1: In Tab.3a, we implement the "Conv." and "Trans." architectures with simple standard transformer and convolution layers. Since transformer has a global perceptive field while convolution only has a local one, "Trans." architecture facilitates denser cross-modal interaction (every element can interact with each other in the attention process) than the "Conv." one. Therefore, the better object detection performances achieved by "Trans." indicate that denser vision-language interaction facilitates better dense vision-language alignment. We will rephrase "complete vision-language interaction" in L290 to avoid confusion in the revised version.

Q2-4: Explanation to phase 1

A2-4: It is worth noting that (1) in order to balence vision-centric and the general-purpose conversational (GPC for short) ability, the ratio of VQA data to vision-centric data should be set to 2:1. In our current schedule, we train phase 2 for 10k steps using such data ratio; (2) learning the dense instance perception (DIP for short) ability requires training for even longer, which is 50k steps in current phase 1. If phase 1 and 2 are merged and the above conditions are met, we need to train at least 50k * 3 = 150k steps to ensure both abilities are well learned. The training cost is significantly larger than our current schedule (10k+50k=60k). Having an independent phase 1 without VQA data lets the model focus on learning the DIP ability without affecting the GPC ability, then its learned ability can be maintained in phase 2 with less steps.

作者回复

We sincerely appreciate all reviewers for their efforts and valuable comments. We first address some common questions here, and then respond to each reviewer separately. We hope our responses can clarify the concerns of reviewers.

Common Questions

Q1: Comparison with LMM generalists

A1: Compared with LMM generalists, Lumen accomplishes versatile dense visual perception tasks without compromising general-purpose conversational capabilities. This NOT easy because directly adapting outputs of various vision-centric tasks into language-oriented conversations to tune the LLM will face great challenges due to the inherent lack of orders and semantics in vision-centric tasks:

  1. Lack of order: Vision-centric tasks like object detection and instance segmentation requires the model to densely generate boxes or masks for all instances of the same category. These boxes or masks are inherently unordered, making it challenging to reformulate them into sentences without confusing the model as discussed in L38~L41 of the main paper.
  2. Lack of semantics: The output formats of versatile vision-centric tasks (i.e, boxes, masks and points) lack high-level semantics comparable to text words. Therefore, integrating conversation data reformulated from vision tasks can hurt the general-purpose conversation capability of LLM

To support the above claim, we also provide experimental comparison as proofs in A-1 to Reviewer 1g9i and the attached PDF file. We will further clarify the above statements in the revised version.

最终决定

The submission aims to train multimodal modes using a more " vision-centric" approach and proposes a method for it. After considering the submission and the rebuttal, all of the reviewers were positive about the submission and recommended acceptance. The AC agrees. However, several of the reviewers raised important questions and points of clarification on the writing, claims, and methodology (e.g., see the detailed comments by AJBg). They are important to be addressed in the final version.

An additional piece of feedback that could improve the quality of the final version: similar to some of the reviewers, also the AC found the writing challenging. In particular, using phrases that are too ambiguous (below is a copied example) or more grandiose than they need to be. Also, the title appears quite general and doesn't prepare the reader for actually the technical content that is in the paper. The authors are strongly recommended to consider these when revising their writing for the final version.

"it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities."