Vision-Language Instruction-enhanced Tuning via Parameter-efficient Learning
摘要
评审与讨论
This paper proposes a new method called VITAL for efficient fine-tuning of large vision-language models. VITAL first enables lightweight model training using only 2% of parameters through automatic mode approximation. More importantly, VITAL enhances instruction semantics from two perspectives: 1) aggregating more context via enhanced instruction mixture to aid multimodal fusion, and 2) strengthening the connection between the proposed parameter-efficient tuning method and mutual information through a scored-based information bottleneck. Experiments on six multimodal benchmarks show VITAL outperforms state-of-the-art approaches in most cases, even surpassing full fine-tuning performance.
优点
-
The proposed method achieves impressive performance improvements while using only 2% of model parameters, making large models much more accessible.
-
The novel techniques for leveraging instruction information during efficient fine-tuning are innovative.
-
Enhancing instruction semantics is an important contribution for improving multimodal alignment.
-
Strong results demonstrated on multiple datasets highlight generalization capabilities.
缺点
-
The proposed mode approximation method lacks some details on implementation.
-
It is unclear how diverse the enhanced instructions are and whether they cover key aspects sufficiently.
-
Additional ablation studies on the different components could further validate their contributions.
问题
How do you generate the enhanced instructions? Are they diverse and cover multiple perspectives?
Have you conducted ablation studies to validate the individual contribution of each proposed component?
How does VITAL specifically compare against other efficient tuning techniques? Where does it have advantages?
This paper proposes a visual-language instruction tuning framework. The key novelty of this paper is proposing to use enhanced instructions to extract visual features from the image. Specifically, multiple instructions are fed to the Q-former of instructblip to extract visual features, which are then fused using weighted sum. Some experiments are conducted to showcase the effectiveness of the proposed method.
优点
The use of multiple instructions for visual feature extraction (maybe) useful.
缺点
- The novelty of this paper is limited, since extracting the visual feature corresponding to different instruction input is originally proposed by InstructBlip, and the Q-former is also proposed and pretrained by previous work (BLIP-2).
- The use of multiple instructions to extract the visual feature does not make much sense to me, since the original purpose of instructBLIP is to extract visual feature adaptively according to DIFFERENT instruction. If the author wishes to maintain all the visual features, why not directly feed all visual tokens from CLIP to the LLM, like in LLAVA?
- The experiments seem unconvincing, since the performance gain is marginal as shown in table 3.
问题
Please refer to the weaknesses.
This paper suggests a parameter-efficient fine-tuning (PEFT) method named VITAL for multimodal instruction-tuning tasks. VITAL combines CP Decomposition-based Lora and instruction-based visual feature fusion in a weight-sharing manner. VITAL conducts experiments on Image Captioning and QA tasks with a QFormer-involved sequential model architecture. Experimental results show improved performance over full-parameter fine-tuning and some other PEFT methods.
优点
This paper is generally well-written and easy to follow. Experimental results show improved performance over full-parameter fine-tuning and some other PEFT methods.
缺点
- Novelty
- VITAL is a combination of existing Lora and weight-sharing methods. The only new mechanism is fusing features output from weight-shared Q-Formers who take repeated visual features and different instructions as inputs, which is limited and does not deserve a good contribution.
- Method
- Why use CP Decomposition-based Lora instead of standard Lora is unclear. Could the authors provide some insights behind it?
- The instruction mixture requires multiple forwards on Q-Former, which introduces extra computational costs and latency/throughput compared with other standard PEFT methods. It is unclear to what extent this mechanism will affect the overall computational cost and latency/throughput of the model.
- VITAL is only applicable for QFormer-involved models, for example, BLIP2[1]-based ones. However, recently more popular multimodal LLMs such as LLaVA and MiniGPT-4 v2 do not use Q-Former, which means the scope of application is limited.
- Experiments
- Some baselines are missing. The comparison with other PEFT methods is not enough. More baselines, such as prompt tuning-based methods, have been mentioned in the related works but not compared.
- Some ablation studies are missing
- Comparisons with more simple strategies to achieve the instruction mixture are needed. For example, using the average of different outputs from Q-Former without weighting, or selecting the most important outputs while discarding others according to some significant metrics and ranking.
- Experiments conducted with different numbers of augmented instructions (k) are needed. It would be better to see a trade-off illustration between the model performance and k.
- Experiments on different modules (e.g., ViT and LLMs) other than Q-Former or on other model architectures (e.g., LLaVA[2] and MiniGPT-4[3]) are needed to verify the effectiveness of VITAL.
- The extra computational cost and latency/throughput introduced by the instruction mixture should be reported to make a fair comparison with other PEFT methods.
References
[1] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
[2] Visual instruction tuning.
[3] Minigpt-4: Enhancing vision-language understanding with advanced large language models.
问题
See Weaknesses.
This paper proposes instruction-enhanced VLM fine-tuning method. Aiming at parameter-efficiency learning, this paper proposes an instruction-enhanced module by adopting an enhanced instruction mixture produced by a Q-Former (Figure 2). Results show the proposed method achieves better performance than LoRA and Adapter.
优点
- Important topics. Fine-tuning large VLMs are expensive, so efficient fine-tuning studied in this paper is an important topic.
- Intuitive ideas: the key idea of the enhanced instructions is that we need a mixture of instructions (instead of the widely used single instruction) to better help capture the diverse information from the vision modality. This idea is intuitive and sound.
- Promising results: by only fine-tuning 2% parameters, it achieved pretty good results and outperformed previous LoRA and Adapter.
缺点
- Incomplete metrics: this paper mostly focuses on #parameters, but #parameters should not be the only metric; another important metic is the actual fine-tuning cost (e.g., GPU/CPU run time).
- Relatively weak baseline. It mostly compares to LoRA and Adapter, instead of the most recent methods such as UniAdapte/VL-Adapter/MAPLE mentioned in the introduction.
- Writing and experiments can be improved. Enhanced instruction mixture is the key of this paper, but the paper only contains two short description to explain it. Would be nice to provide more insights and studies such as: (1) if one instruction is not enough, could you study the impact of different number of instructions? (2) if instruction mixture is critical, could you study different potential mixture methods?
问题
A few questions are in the weakness section. In addition:
- Could you provide actual fine-tuning cost comparison (e.g., GPU/CPU runtime)?
- Could you provide more ablation studies on #instructions, and different ways to generate instruction mixtures?
- Could you compare to one or two more recent efficient fine-tuning methods?
2x R and 2x BR. This paper proposes instruction-enhanced VLM finetuning for efficient representation learning. All the reviewers lean not to accept this submission due to (1) limited novelty and (2) unconvincing experiments (out-of-date baselines, marginal performance gains, incomplete ablation studies, and unclear implementation details). No rebuttal is submitted for clarification.
为何不给更高分
N/A
为何不给更低分
N/A
Reject