PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
5
6
3.8
置信度
ICLR 2024

Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11

摘要

关键词
Instruction FinetuningMultimodalLarge Language Model

评审与讨论

审稿意见
6

The paper collects noisy and high-quality instruction-following data to enhance visual instruction tuning for text understanding in images. Their model LLaVAR incorporates this new data and improves performance on text VQA and instruction following for text-rich images. The enhanced capability allows more natural interaction based on real-world online content combining text and images.

优点

  1. The paper focuses on an important problem of improving OCR ability for multimodal LLMs like LLaVA.
  2. It identifies that key factors to improve model performance are training data and image resolutions. To address this, the paper collects over 400k specialized training examples to enhance OCR capabilities.
  3. Extensive experiments verify the effectiveness of the proposed training data.

缺点

  1. The conclusions are a bit obvious - that higher resolution inputs and more specialized training data improve LLaVA's OCR performance.
  2. The most important contribution of the paper is the collected dataset. It succeeds in showing the data improves LLaVA's OCR capabilities, but does not demonstrate it is superior to other visual instruction datasets. For example, mPLUG-Owl has comparable OCR performance to LLaVAR under the same resolution in Table 2. This raises the question of whether OCR-specific data is needed, or if the scale of data in the paper is insufficient.
  3. The evaluation is limited, mostly relying on 4 OCR QA datasets. As the authors admit in Fig 4(5), this evaluation may be unreliable. More scenarios like the LLaVA benchmark would be expected, especially in ablation studies.

问题

  1. Why did the authors collect data based on LAION, rather than some well-annotated OCR dataset?
评论

Thank you for recognizing the strengths of our approach and the efforts we put into enhancing OCR capabilities and data collection for multimodal language models.

For comparison with baselines/methodology novelty/Evaluation metric/Evaluation Scenarios, please refer to the general response.

Whether OCR-specific data is needed:

Compared to hundreds of millions of text-image pairs used in mPLUG-owl, we acknowledge that the scale of our data is relatively limited, which is also relatively affordable for most academic labs. We presume that training on large-scale non-OCR data improves OCR performance, as many of the captions in LAION datasets are equivalent to incomplete OCR results (Texts in an online image will sometimes appear in the captions). In the scale of our experiment, we observe similar improvement that just training on captions of text-rich images can help with text recognition capability (In Table 3, variant (4) is better than variant (1)). However, training on captions only (variant (4)) is not as good as training on OCR-based data (variant (2)(6)), at least in the scale of our experiments. We assume such a training signal can be strong enough for the scale of mPLUG-Owl. In general, we believe OCR-specific data is necessary for data efficiency.

Ablation study on the LLaVA benchmark

conversationdetailcomplex
LLaVA83.678.195.2
LLaVA + RpretrainR_{pretrain}86.779.395.1
LLaVA + RfinetuneR_{finetune}79.079.498.0
LLaVAR84.478.996.6

We find that including pretraining data improves the conversation capability, probably because longer training data leads to generating longer responses (Table 1). On the other hand, including finetuning data only hurts the conversation capability but increases complex reasoning. Combining pretraining and finetuning data improves the trade-off between conversation and complex reasoning. Generally speaking, GPT-4-based evaluation is not very robust, as there are some clear clues that it favors long responses [1]. By providing results on the LLaVA benchmark, we prove that incorporating our data will at least not harm the performance of interacting with natural images.

Why based on LAION rather than well-annotated OCR datasets:

The well-annotated OCR datasets are usually restricted to certain domains such as black-and-white documents, book covers, etc. As shown in Appendix Figure 10, the text-rich images in the LAION dataset, which contains all kinds of text-image pairs from the internet, are diverse and usually interleaved with natural images. We believe our instruction-following dataset based on LAION suffers from a relatively small domain shift compared to previously collected instruction-following data based on COCO, thus beneficial for potential knowledge transfer (Section 5.4). Also, we believe that collecting data based on real-world documents and well-annotated OCR datasets is an important next step to extend the scope of the data.

[1] Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023.

评论

Thanks for the responses of the authors, which have solved most of my concerns.

I would like to lift my score if the authors provide a revised paper with the new experiment results.

Best.

评论

Dear Reviewer miv5,

We’ve updated the paper to include the new experiment results.

Thank you for your feedback and effort in reviewing our paper.

评论

Thanks for the authors' responses.

Most of my concerns about experiments have been solved. I have lifted the score.

审稿意见
5

This work improves the collection pipeline of instruction-following data, allowing for the collection of a large-scale dataset of text-rich images. Leveraging GPT-4, this work further constructs an instruction tuning dataset consisting of 422K noisy pretraining data and 16K conversations and validates it on the recent work LLAVA. The results on multiple text-based VQA datasets show that this dataset improves the performance of LLAVA in text understanding. Case analysis also demonstrates that LLAVAR has stronger image-text understanding abilities than LLAVA.

优点

  1. The paper is well organized and easy to follow.
  2. The improved data collection pipeline overcomes the limitations of the existing dataset, which lacks text-rich images and relevant instruction-tuning annotations.
  3. The explanation of the data collection process is detailed, and the relevant experimental analysis provides support for investigating how to enhance visual Large Language Models' understanding of text-rich visual content.

缺点

  1. The contribution of this paper appears to be limited. The proposed data collection pipeline in this paper is based on the one used in LLAVA, but with incremental improvements. Similarly, the model architecture in this paper also follows LLAVA without a specific design for text-rich scenarios.
  2. The dataset introduced in this paper only brings limited improvement. In Table 2, LLAVAR achieves comparable performance with mPLUG-Owl, a model that was not trained on a text-rich dataset. The paper also does not provide a detailed comparison with other state-of-the-art models in the field of text-rich image understanding, which would help to better understand the relative performance of the proposed method.
  3. The fragmented OCR results in a few words may also exist in real-world text-rich data such as poster, table, directly removing this kind of data may be also different from the real-world distribution.

问题

  1. How is the performance if your dataset is trained based on mPLUG-Owl? Since it is a high baseline of your method. Can you further fine-tune other open-source models using the dataset from this article and provide performance comparisons?
  2. Can you provide experimental results and analysis on more text-rich image understanding benchmarks (e.g., Information Extraction, Document Classification, OCR).
评论

Thank you for acknowledging the organization and clarity of our paper, as well as appreciating the advancements we've made in data collection and our detailed analysis.

For comparison with baselines/methodology novelty/evaluation scenarios, please refer to the general responses.

LLAVAR achieves comparable performance with mPLUG-Owl, a model that was not trained on a text-rich dataset? :

It is hard to conclude that mPLUG-Owl is a model not trained on a text-rich dataset, as many of the captions in LAION datasets are equivalent to incomplete OCR results (Texts in an online image will sometimes appear in the captions). In the scale of our experiment, we observe similar improvement that just training on captions of text-rich images can help with text recognition capability (In Table 3, variant (4) is better than variant (1)). However, training on captions only (variant (4)) is not as good as training on OCR-based data (variant (2)(6)), at least in the scale of our experiments. We assume training on captions can be powerful enough for the scale of mPLUG-Owl (1000M+ text-image pairs). However, we believe our data is lightweight and effective, and our pipeline is more customizable.

On fragmented OCR results:

The reason why we tried to remove fragmented OCR results from prompting GPT-4 is to remove repetitive and meaningless questions generated by GPT-4. If there are only a few unrelated words presented to GPT-4, it is hard to generate meaningful question-answer pairs for the texts. Note that (1) our noisy instruction-following data based on raw OCR results still contain such fragmented OCR results. (2) Our experiments show that the learned text recognition capability transfers well to scenarios like ST-VQA and textVQA, where the texts in images are usually fragmented words.

审稿意见
5

This paper proposes a methodology to improve the text reading capability of large language and visual assistants. There are mainly two contributions: the data collection procedure and the improvement of LLaVA with the data. Text-rich images are collected by applying an image classifier with some filtering criteria. Off-the-shelf OCR tools are then used to obtain the texts in the images. A pretraining task is defined to output the transcribed texts as target. For finetuning, GPT-4 is used to generate instruction-following data. GPT-4 is asked to generate a sequence of question-answer pairs. The model is finetuned with the generated data. The experimental results confirm that LLaVAR improves LLaVA for tasks requiring reading texts. The code, data and model will be released to the public.

优点

This paper shows a practical way to collect a large amount of text-rich data. The quality of the data is confirmed by the experiments where the training with the collected data improves the model. The data will be released to the public and the community will be able to benefit from the work.

The methodologies to generate the pretraining and finetuning data are reasonable. The use of GPT-4 to generate instruction-following data is very similar to the idea of LLaVA, but it seems also effective to generate such data for tasks requiring reading texts.

缺点

What seems important to improve text reading capability of this type of models is to train the models with a task that requires to read texts. This work also does it by generating data with OCR-ed texts and defining tasks that require reading texts. As expected, it improves the model in terms of text reading capability. However, the problem is that this seems a shared problem in this field and there are other studies that tried to improve text reading capabilities of this type of models (e.g. PreSTU, Pix2Struct). There is no discussion in this aspect and it looks like this is yet another attempt with the same objective. It would be required to make the novelty and advantage clear against other studies.

This is essentially an extension of LLaVA with OCR tasks. It is certainly important to improve text reading capability of this type of models, but it looks a little bit incremental in terms of methodological novelty.

问题

I wanted to understand the detail of "GPT-4-based instruction-following evaluation". My assumption was that GPT-4 was treated as Oracle (or GT) and some scores were computed against it. However, it was not very clear how text-based GPT-4 can be used to generate GT for tasks with image inputs. Also, how to compute the scores was not clear.

伦理问题详情

No concern.

评论

Thank you for recognizing the practicality and effectiveness of our data collection and training methodologies, as well as the potential community benefits from the release of our data and model.

For methodology novelty, please take a look at the general responses.

Comparing to previous studies like PreSTU, Pix2Struct:

Our work focuses on improving the text recognition capability of the multimodal instruction-following model, which can be built on related prior work like PreSTU and Pix2Struct. Assuming we have a good frozen image encoder (which can be CLIP, Pix2Struct, or Pix2Struct), we study how to align its text recognition capability with large language models (pretraining stage) and how to maintain and acquire such capability during instruction-following (finetuning stage). In current Table 4 and Appendix E, we provide the results for using Pix2Struct to augment CLIP and demonstrate its improvement over CLIP 2242224^2.

We assume the amount of data needed for feature alignment should be much less than that needed for feature learning. As our data and pipeline focus on feature alignment, we believe it can naturally benefit from any advanced image encoders like Pix2Struct and PreSTU.

GPT-4-based instruction-following evaluation

GPT-4’s responses are treated as oracles. We provide text-only GPT-4 with detailed descriptions of the image (human-written captions, OCR results) and collect feedback as oracles on related questions. To calculate the score, we provide text-only GPT-4 with the detailed description again, together with one question and two answers (one from text-only GPT-4, one from the model we want to test), and ask GPT-4 to give scores to the two answers (1 ~ 10). The final score is the ratio between the average score of the tested model and the average score of GPT-4. For example, “83.1” in Table 5 means its score is 83.1% of GPT-4’s score.

评论

Thank you for the response. I'm going to definitely reflect the rebuttal to make my final decision. Thanks again for your hard work.

审稿意见
6

This paper enhances the visual text understanding ability of large multimodal models by instruction tuning methods. First, two sets of noisy and high-quality instruction-following data are constructed. Specifically, the high-quality instruction-following data are generated by prompting text-only GPT-4 with OCR results and captions. Then, a two-stage training strategy is developed, with the first stage learning OCR capability and the second stage learning high-level understanding capability. Extensive experiments verify that the proposed LLaVAR model can improve performance on both natural and text-rich images.

优点

  1. This work a pioneering exploration of visual instruction tuning for text images, which can provide some useful insights to the community.
  2. The proposed model has ability to deal with high-resolution images by integrating extra visual encoders and cross-attention modules.
  3. The experiments are basically sufficient to demonstrate the superiority of the LLaVAR method, especially relative to the LLaVA baseline.

缺点

  1. This work is less innovative in approach, as it mainly focuses on the construction of instruction-following data, while the proposed model and implementation pipeline basically follow LLaVA.
  2. From the results in Table 2, LLaVAR has no significant performance advantage compared with existing methods under the same resolution (2242), such as mPLGU-Owl (2023). Besides, it is better to evaluate the parameter sizes of these comparison models.
  3. In Section 5.3, only one case study is carried out, so the derived conclusion is hardly convincing.
  4. In Figure 7, the notations are not clearly explained, and the implementation details cannot be visually reflected in this figure.

问题

  1. As can be seen in Figure 4 and Figure 5, OCR errors are inevitable, e.g., “Boynton” vs. “Byington”. Can you provide some results to analyze the impact of OCR errors?
  2. As mentioned in the paper, the adopted metric only considers the recall, so it is not very reliable. Have you tried other quantitative metrics to prove the effectiveness of the method, such as the metrics designed for the image captioning task?
  3. What does “temperature” refer to in the first paragraph of Section 5?
评论

Thank you for recognizing our work's innovative aspects and potential impact! We address your comments below:

For comparison with baselines/methodology novelty/evaluation metric, please refer to the general responses.

Table of language model parameter size and training data size.

Language Model ParameterTraining data size
BLIP-2-FlanT5XXL11B129M
OpenFlamingo7B15M
MiniGPT413B5.0M
LLaVA13B0.6M
mPLUG-owl7B1112M
LLaVAR13B1.0M

As all baseline models use ViT CLIP 224 as their visual encoder, we list the parameter size of their language model above, together with their training data size.

In our experiments, we find that the language model size has a trivial effect (see the table below) on the performance related to text-based VQA. We believe data efficiency is a more important factor to consider while comparing performance.

ST-VQAOCR-VQATextVQADocVQA
LLaVAR 7B28.925.637.86.2
LLaVAR 13B30.223.439.56.2

More comprehensive case study:

The case study is now conducted on 825 examples from OCR-VQA. We have updated Figure 6 in the draft. In this large-scale case study, we still observe a threshold for recognizable texts for both 2242224^2-based and 3362336^2-based LLaVAR as the accuracy sharply decreases when the height is smaller than 7 pixels.

On Figure 7:

We have provided an updated Figure 7 with detailed captions. A sketch of the implementation details is as follows: Given an image, it is simultaneously processed by visual encoders V1V_1 and V2V_2. V1V_1 features are transformed by transformation matrix WW and directly used as input embeddings to the language model. For V2V_2 features, they are transformed by transformation matrix KK and VV and used as keys and values to calculate the cross attention in every transformer layer (assume there are NN layers), which uses the transformed hidden states (through QQ) from the self-attention module as queries.

The impact of OCR errors:

We take 1673 examples from OCR-VQA, which have ground truth answers with more than 10 characters, to study such OCR errors. We (i) define “correct” as the ground truth answers that are exactly in the predictions, and (ii) define “partially correct” as there exists a substring in the prediction that has high enough similarity with the ground truth but not the same. Specifically, we look at all substrings with the same length of the ground truth in the prediction to calculate ANLS (Average Normalized Levenshtein Similarity) and regard the prediction as “partially correct” if the highest ANLS is greater or equal to 0.5 but smaller than 1.

Correct %Partially Correct%
LLaVA2241.6%8.7%
LLaVAR2246.8%22.8%
LLaVA3362.2%11.2%
LLaVAR3369.0%26.8%

We find that a considerable amount of predictions can be considered partially correct, which indicates the actual performance of tested models is better than the reported accuracy numbers. However, the percentage of partially correct predictions is highly correlated with the percentage of correct predictions. Therefore, we believe that the current metrics can effectively compare the performance of different models.

Clarification on the term “temperature”:

The temperature used to sample examples from language models.

评论

We thank the reviewers for their detailed feedback and insightful comments.

Comparing to mPLUG-Owl within the same resolution:

mPLUG-Owl is trained on 1000M+ text-image pairs, while the original LLaVA is trained on about 0.6M text-image pairs. Our model, LLaVAR, is trained on about 1M text-image pairs. Within the same resolution, LLaVAR demonstrates a good performance with decent data efficiency.

Furthermore, we provide the results of finetuning mPLUG-Owl using our 16K GPT-4-based Instruction-following Data.

ST-VQAOCR-VQATextVQADocVQACT80POIEChartQA
mPLUG29.328.640.36.981.93.39.5
mPLUG_ours29.631.240.87.084.73.710.2

(We added three extra datasets: CT80 [1] (OCR), POIE [2] (Information Extraction), and ChartQA [3].)

We find that our data can boost the performance in most cases, though the mPLUG-Owl baseline is extensively trained on 1000M+ text-image pairs.

Methodology Novelty:

Though the main part of our study follows the original architecture of LLaVA, we also extend the architecture by connecting the language model with another high-resolution encoder through cross-attention.

In the updated draft, we include the comparison between LLaVA and LLaVAR on those extended architectures.

ST-VQAOCR-VQATextVQADocVQA
Pix2Struct + LLaVA21.911.828.74.4
Pix2Struct + LLaVAR35.8 (+13.9)30.7 (+18.9)45.6 (+16.9)15.3 (+10.9)
ConcatCLIP + LLaVA23.114.230.55.1
ConcatCLIP + LLaVAR42.1 (+19.0)30.8 (+16.8)52.1 (+21.6)18.5 (+13.4)

Note:

  1. Pix2Struct is a visual encoder trained on screenshots to HTML transformation. It supports up to 2048 patches with size 16216^2, equivalent to 10245121024 * 512.
  2. ConcatCLIP refers to using 16 CLIP-ViT-L/14 models to encode the 444 * 4 grids of images separately and then concatenate the extracted features together. In other words, it supports 8962896^2 resolution.

Instead of pursuing the best performance within the low resolution (e.g., 2242224^2), our study reveals that higher-resolution models benefit more from our collected data, suggesting our data is underutilized in the original LLaVA architecture. We hope our result can inspire future design to leverage existing data better with suitable visual encoders.

Evaluation Scenarios:

We added three extra datasets: CT80 [1] (OCR), POIE [2] (Information Extraction), and ChartQA [3]. We use the same VQA metric as other text-based VQA datasets.

CT80POIEChartQA
BLIP-280.92.57.2
OpenFlamingo67.72.19.1
MiniGPT457.31.34.3
mPLUG-Owl81.93.39.5
LLaVA22461.51.99.2
LLaVAR22481.65.710.2
LLaVA33664.92.510.2
LLaVAR33683.08.713.5

[1] Anhar Risnumawan, Palaiahnakote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048, 2014.

[2] Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, and Xiang Bai. Visual information extraction in the wild: Practical dataset and end-to-end solution. arXiv preprint arXiv:2305.07498, 2023.

[3] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.

评论

We thank the reviewers for their detailed feedback and insightful comments.

Evaluation Metric:

The metric used for text-based VQA is the standard practice in VQA benchmarks. For STVQA and DocVQA, previous works use ANLS (Average Normalized Levenshtein Similarity) as the metric, which calculates the average normalized edit distance, which only works for supervised models that are trained to output short and precise answers. It works badly for instruction-following models that usually output long sequences instead of brief answers. For reference, we provide more text-matching metrics to demonstrate the improvement of our model, which works well except for OCR-VQA.

ST-VQA:

METEORROUGE-LCIDEr
LLaVA2247.08.215.3
LLaVAR22410.011.424.5
LLaVA3368.49.919.1
LLaVAR33612.814.330.9

textVQA:

METEORROUGE-LCIDEr
LLaVA2248.710.512.2
LLaVAR22412.514.921.4
LLaVA3369.912.115.3
LLaVAR33614.817.427.0

OCR-VQA:

METEORROUGE-LCIDEr
LLaVA2240.20.10.0
LLaVAR2240.30.10.0
LLaVA3360.30.10.0
LLaVAR3360.20.10.0

(We assume these metrics are not valuable metrics for OCR-VQA since the ground truth answers are too short in most cases.)

DocVQA:

METEORROUGE-LCIDEr
LLaVA2243.84.86.3
LLaVAR2245.66.912.7
LLaVA3364.65.68.7
LLaVAR3368.610.021.5
评论

Dear reviewers, we have revised the paper to include the new experiment results, the main updates are:

  1. Results of extra text-matching evaluation metric: Appendix D and Table 8,9,10,11.
  2. More evaluation Scenarios: Appendix E and Table 12.
  3. Discussions on the comparison with mPLUG-Owl and the result of finetuning mPLUG-Owl using our data: Appendix F and Table 13.
  4. Extended LLaVA architecture with high-resolution encoders: Appendix H, Figure 7, and Table 14.
  5. The impact of OCR spelling errors: Appendix I and Table 15.
  6. Case study of the recognizable font size: Section 5.3 and Table 6.
  7. Ablation study on the LLaVA benchmark: Appendix J and Table 16.

Due to the page limit, most updated content is presented in the Appendix. We’ve added pointers to the Appendix in the main paper.

Please let us know if you have further questions!

AC 元评审

This paper presents a method to enhance the text-rich image understanding capability of a multimodal LLM, in particular LLaVA. The main contributions of the work are: A new dataset consists of 400k examples with OCR enhanced text; a new LLaVAR model tuned using the constructed dataset.

The reviewers acknowledge that this work is among the earliest work to dive into visual instruction tuning for text-rich images, and the constructed dataset from this work will surely benefit the community.

However, a primary concern raised by all reviewers is related to novelty. The common understanding is that the distribution of the training dataset significantly impacts performance, and it is expected that incorporating more OCR-ed texts and related tasks would enhance the model's text reading capabilities. Consequently, the overall contribution of the paper is deemed limited.

为何不给更高分

Limited novelty. The contribution of the paper is mostly a new dataset, and the resulting model is expected without any surprise.

为何不给更低分

N/A

最终决定

Reject