Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

审稿意见

评分: 6置信度: 42024-11-01

The manuscript presents Dragonfly, a novel Vision-Language Model (VLM) that employs a multi-resolution zoom-in encoding strategy to enhance fine-grained visual understanding. Unlike conventional Vision Transformers (ViTs) that downsample images to fixed, low resolutions—thereby losing critical details—Dragonfly processes images at higher resolutions and employs a multi-crop technique that exceeds the native resolution. This approach allows the model to capture intricate details from non-dominant objects, charts, and embedded text, which are often challenging for existing ViTs. To address the computational complexity arising from the increased token count, Dragonfly utilizes a mean-pooling aggregation strategy. The model demonstrates competitive performance across ten general-domain benchmarks and sets new benchmarks in several biomedical tasks, outperforming larger models trained on significantly more extensive datasets.

优点

Dragonfly introduces an innovative multi-resolution zoom-in encoding strategy that surpasses native image resolutions, enabling the capture of intricate details from non-dominant objects, charts, and embedded text.
It implements a simple yet effective mean-pooling aggregation method and achieves good performance across a diverse set of benchmarks.

缺点

Novelty. The proposed method builds upon existing multi-resolution and multi-crop techniques without offering substantial novel contributions. The idea of processing images at higher resolutions and using multi-crop strategies has been explored in prior works, and Dragonfly does not sufficiently differentiate itself beyond these established methods. Model Comparison: Dragonfly is developed using the more advanced Llama3 model, whereas comparable methods utilize less capable language models, such as Llama2 and Qwen2. This discrepancy raises concerns about the fairness of comparisons. How does Dragonfly's performance measure up when evaluated against these models? Data Influence: It is unclear whether the observed performance improvements with Dragonfly stem from the curated data or from the model's design. How does Dragonfly perform when tested with a commonly used dataset?

问题

See weakness.

评论- Reply to Reviewer etVR

2024-11-22

Novelty. The proposed method builds upon existing multi-resolution and multi-crop techniques without offering substantial novel contributions...

Thank you for raising this point. We agree that the mechanism here is indeed similar to any-resolution methods: like us, LLava-1.5-HD/Llava-UHD also featurize multiple crops and pass them all to the LM, potentially with compression to manage their computational cost.

The difference is that these prior methods use a small number of crops, because their only goal is to preserve the image at native-resolution and avoid destroying information via downsizing. If this was all we needed for a rich visual representation, there would be no more room for improvement by zooming in. Our results clearly show that this is not the case, and that zooming beyond native resolution leads to significant gains, especially for tasks like text and chart understanding that rely on fine-grained, localized image details (see the results of our controlled comparison in Table 1).

The impact of our work is not introducing the idea of multi-crop features, but in finding a new direction for improving visual encodings for VLMs. Our interpretation of the consistently strong results is that it exposes a fundamental weakness of current ViTs: they fail to capture fine-grained details within their context window, even when they are not affected by excessive downsizing. In the short term, this issue can be mitigated by techniques like ours to featurize multiple zoomed-in crops and ensure the LM receives high-quality embeddings. Long-term, we believe a more efficient but challenging solution is to revisit ViT pre-training and even post-training techniques and improve ViTs’ ability to capture more information from single crops (a point we’ve now added to the Conclusion).

To better clarify our motivation and contribution in relation to existing work, we have refined the Abstract, Introduction, and Conclusion sections of our paper. We have also included an additional ablation to further solidify the source of our improvements and differences from prior work in Section 4.4 of our updated submission.

Model Comparison: Dragonfly is developed using the more advanced Llama3 model, whereas comparable methods utilize less capable language models, such as Llama2 and Qwen2. This discrepancy raises concerns about the fairness of comparisons. How does Dragonfly's performance measure up when evaluated against these models?

Thanks for raising this important and valid concern. We would like to clarify that for a controlled and fair comparison, Table 1 shows results of several models implemented within our codebase (including the same dataset, compute resources, LLM, and ViT backbone), where the only difference is the visual encoding strategy. In response to the reviewer’s concern, we have also included new results implementing LLaVA-UHD within our experimental setup, a related work that aims to handle images at native resolution. Empirically, Dragonfly outperforms all these baseline methods across all evaluated tasks, demonstrating the strength of our approach. Notably, LLaVA-UHD’s image encoding aims to preserve native image resolution, theoretically maximizing raw visual information. The fact that Dragonfly surpasses LLaVA-UHD indicates that zooming in beyond the native resolution provides a meaningful performance boost, further validating our design choices.

Regarding the comparison to other models in Table 4, we wanted to test our model when scaled up to a larger dataset and show how it compares against other open-source models, but we acknowledge that these results are confounded by many differences. Most of these models are trained on significantly larger datasets than ours.

Data Influence: It is unclear whether the observed performance improvements with Dragonfly stem from the curated data or from the model's design. How does Dragonfly perform when tested with a commonly used dataset?

We thank the reviewer for raising this important point. Most open-source models are trained on slightly different data mixtures, making it difficult to control for variations in training data. Additionally, even with identical data mixtures, differences in model architecture and compute resources can influence performance. To ensure a fair comparison, we trained the most relevant baseline models (LLaVA-1.5-HD, LLaVA-UHD, and several implementations of our approach) using the same experimental codebase as Dragonfly, including compute, dataset, LM and ViT backbones. The results of this controlled comparison are provided in Table 1 of paper. Regarding the dataset, we use commonly used datasets in the public domain such as LLaVA-Pretrain and ShareGPT4V. Detailed descriptions of these datasets and our filtration process can be found in Section 4.1 and Appendix A.

2024-11-25

Thank you for your efforts, most of my concerns have been resolved, and I have also improved my scores, good luck!

审稿意见

评分: 5置信度: 42024-11-02

This paper introduces multi-crop techniques beyond the native resolution for high-resolution images. To handle the huge number of numbers, the authors employ the average pooling startegy on each crop. Except for general domain of benchmarks on fine-grained image understanding, this paper also introduce the contributioni on biomedical tasks. They also curate a SFT dataset including 2.4M general images and 1.4M biomedical images for training.

优点

The ablation studies proves their proposed strategy.
Biomedical domain is considered by this paper.
The motivation is nature and easy to follow.
A SFT dataset with different domains and huge number of images is built.

缺点

The novelty is limited. The proposed strategy is only an extension of any-resolution technical. Compared to any-resolution which uses two levels of rosulotions, they only resize the image, crop more patches and use three levels of resolutions of image.
The visualizations only shows the response of the proposed Dragonfly, other MLLM's responses are encouraged to be listed for better comparision.
To balance the computational costs and performance, they use mean pooling within each crop. The paper lacks the dicussion about how to choose a proper compressing ratio for trade-off.

问题

Please see weaknesses.

评论- Reply to Reviewer wN6B (Part 1 of 2)

2024-11-22

The novelty is limited. The proposed strategy is only an extension of any-resolution technique. Compared to any-resolution which uses two levels of resolutions, they only resize the image, crop more patches and use three levels of resolutions of image.

Thank you for raising this point. We agree that the mechanism here is indeed similar to any-resolution methods: like us, LLava-1.5-HD/Llava-UHD also featurize multiple crops and pass them all to the LM, potentially with compression to manage their computational cost.

The difference is that these prior methods use a small number of crops, because their only goal is to preserve the image at native-resolution and avoid destroying information via downsizing. If this was all we needed for a rich visual representation, there would be no more room for improvement by zooming in. Our results clearly show that this is not the case, and that zooming beyond native resolution leads to significant gains, especially for tasks like text and chart understanding that rely on fine-grained, localized image details (see the results of our controlled comparison in Table 1).

The impact of our work is not introducing the idea of multi-crop features, but in finding a new direction for improving visual encodings for VLMs. Our interpretation of the consistently strong results is that it exposes a fundamental weakness of current ViTs: they fail to capture fine-grained details within their context window, even when they are not affected by excessive downsizing. In the short term, this issue can be mitigated by techniques like ours to featurize multiple zoomed-in crops and ensure the LM receives high-quality embeddings. Long-term, we believe a more efficient but challenging solution is to revisit ViT pre-training and even post-training techniques and improve ViTs’ ability to capture more information from single crops (a point we’ve now added to the Conclusion).

To better clarify our motivation and contribution in relation to existing work, we have refined the Abstract, Introduction, and Conclusion sections of our paper. We have also included an additional ablation to further solidify the source of our improvements and differences from prior work in Section 4.4 of our updated submission.

The visualizations only show the response of the proposed Dragonfly, other MLLM's responses are encouraged to be listed for better comparison.

Thank you for raising this point. As requested, we have added responses from other baseline models (LLaVA-1.5-HD, LLaVA-UHD) and a stronger model (GPT-4o) alongside Dragonfly for the same image and question. These comparisons can be found in Figure 3 of the appendix in the revised submission. While all models perform reasonably well on most tasks, Dragonfly and GPT-4o excel in tasks requiring fine-grained detail, such as reading a license plate, where both provide correct answers while others fail. On a more challenging task of interpreting the numbers in a chart, Dragonfly is the only model that gets it right.

We believe our quantitative results provide a better comparison between models and therefore moved this qualitative figure to the appendix, but we retained a similar set of QA pairs in Figure 2 to demonstrate the types of biomedical questions in our benchmarks.

评论- Reply to Reviewer wN6B (Part 2 of 2)

2024-11-22

To balance the computational costs and performance, they use mean pooling within each crop. The paper lacks the discussion about how to choose a proper compression ratio for trade-off.

Thank you for raising this point because our original experiments only showed results for a single compression setting. We have included additional ablations to address this point below.

In general there exists a tradeoff between the amount of compression and the quality of the embeddings passed to the LM. For simplicity, we originally used a uniform compression level of 36 tokens per patch in our original results, which we empirically found to strike a balance between compression, compute efficiency, and performance. Based on your comment, we have added new and more rigorous ablation results in Tables 7-8 showing performance for a range of compression ratios.

The key takeaways are as follows: aggressive pooling, such as reducing to 9 tokens per crop in high resolution or 16 tokens in medium resolution, significantly hurts performance by discarding fine details. Performance improves notably when increasing to 36 tokens in medium-resolution and 16 tokens in high-resolution, after which gains taper off. Overall, these results confirm that zooming can lead to significant performance gains if we strike an appropriate balance between efficiency and embedding quality, even when this involves significant compression via mean pooling.

Performance Comparison for Compression Levels for Mean-Pooling in the Low + High Resolution Model. Tokens are per crop level.

Benchmark	64 tokens	36 tokens	16 tokens	9 tokens
AI2D	62.9	63.6	62.0	61.4
ScienceQA	80.1	79.0	79.5	77.5
ChartQA	56.9	56.4	55.2	45.4
POPE-f1	86.7	87.7	88.4	85.9
GQA	54.9	55.2	55.9	52.9
TextVQA	66.8	65.2	64.6	59.1
VizWiz	57.7	59.7	59.1	59.1
MME	1421.1	1397.8	1434.9	1309.9

评论- Request for rebuttal feedback

2024-11-27

Dear Reviewer,

Thank you for your insightful and constructive feedback. We have carefully addressed the concerns raised and provided additional clarifications and results to improve the manuscript. We sincerely appreciate your consideration of our rebuttal and look forward to your further feedback.

Best regards,

The Authors

评论- One last reminder before the end of discussion period!

2024-12-03

Dear Reviewer,

Thank you for taking the time to review our paper. We are reaching out one final time to remind you of our rebuttal to your initial review. We would greatly appreciate it if you could take a moment to review it and let us know if we have adequately addressed your concerns.

Best regards, The Authors

审稿意见

评分: 5置信度: 42024-11-04

The paper introduces DragonFly to enhance vision-language models. The main idea is to combine the multi-cropping with mean pooling, so that the VLM can use high-resolution image encoders and work on images' native resolution, while ensuring the efficiency of the model. The proposed enhanced VLM performs well in tasks needing finer details, such as biomedical tasks.

优点

Dragonfly uses multi-crop techniques to process images at high resolutions, thus enhancing the model’s capability to capture fine details. Meanwhile, it uses a simple mean-pooling strategy to reduce visual tokens effectively, which retains the efficiency of the model.
The proposed method achieves competitive performance across multiple benchmarks and shows strong generalizability across both general and specialized biomedical tasks.

缺点

The paper uses a very simple strategy that combines multi-crop and mean-pooling to enhance VLM. However, the motivation for choosing mean-pooling instead of other compression techniques is not clearly stated. It's like an experimental report that simply states this method is effective. So why does mean-pooling outperform other strategies? Why do you choose such a pooling window? Will this direct pooling harm the extraction of the fine details? A more comprehensive analysis is expected.

问题

Please refer to the weaknesses part. Additionally, is there a comparison of the inference time or flops of different methods?

评论- Reply to Reviewer NJjw (Part 1 of 2)

2024-11-22

Thank you for taking the time to review our paper and provide feedback. We respond to the concerns mentioned in your review below. We have also revised our submission and the revisions are all in blue.

The paper uses a very simple strategy that combines multi-crop and mean-pooling to enhance VLM

We are not sure if this is meant as a weakness. If so, we believe the simplicity of our approach is not an issue and perhaps even a positive point. In case the reviewer is concerned about the novelty of our approach, we kindly ask the reviewer to take a look at our responses to other reviewers who directly raised those concerns. Briefly, there is a distinction between previous methods that only aim to avoid excessive downsizing that would destroy image information, and our method that shows that zooming beyond native resolution helps capture additional, fine-grained details that ViTs otherwise overlook.

However, the motivation for choosing mean-pooling instead of other compression techniques is not clearly stated. So why does mean-pooling outperform other strategies?

We would like to thank the reviewer for making this important point and we have addressed the reviewer's concern below.

The main idea we introduce in this work is zooming in beyond native resolution; compression is necessary to make this practical, and we take an empirical approach to finding the technique that works best. We tested multiple techniques in Section 4.2 (Table 1), and we choose mean pooling because it matches or outperforms other methods across our suite of benchmarks.

Alternative approaches, such as the perceiver resampler or using a lower-capacity ViT, present certain trade-offs. A perceiver resampler is capable of compressing image tokens into a smaller number, but it’s a parametric approach that requires learning the mapping; it could work better with more training data, but our experiments operate in a relatively low-data regime compared to works like Qwen-VL, and that may be why a learning-free approach like mean pooling proves more robust and efficient. Based on the reviewer's comment, we ensured that this point is discussed in the paper.

Using a less powerful ViT with fewer tokens addresses the token count issue, but at the expense of high-quality embeddings. The specific model we considered, CLIP ViT-B/32, is trained at lower resolution and with larger patch size, and cannot represent as much image information. In contrast, mean-pooled tokens from a stronger ViT (CLIP ViT-L/14 @ 336px) retain richer and more expressive embeddings, making this strategy particularly effective in tasks requiring fine-grained details.

Why do you choose such a pooling window? Will this direct pooling harm the extraction of the fine details?

Thank you for raising this point because our original experiments only showed results for a single compression setting. We have included additional ablations to address this point below.

In general there exists a tradeoff between the amount of compression and the quality of the embeddings passed to the LM. For simplicity, we originally used a uniform compression level of 36 tokens per patch in our original results, which we empirically found to strike a balance between compression, compute efficiency, and performance. Based on your comment, we have added new and more rigorous ablation results in Tables 7-8 showing performance for a range of compression ratios.

The key takeaways are as follows: aggressive pooling, such as reducing to 9 tokens per crop in high resolution or 16 tokens in medium resolution, significantly hurts performance by discarding fine details. Performance improves notably when increasing to 36 tokens in medium-resolution and 16 tokens in high-resolution, after which gains taper off. Overall, these results confirm that zooming can lead to significant performance gains if we strike an appropriate balance between efficiency and embedding quality, even when this involves significant compression via mean pooling.

Performance Comparison for Compression Levels for Mean-Pooling in the Low + High Resolution Model. Tokens are per crop level.

Benchmark	64 tokens	36 tokens	16 tokens	9 tokens
AI2D	62.9	63.6	62.0	61.4
ScienceQA	80.1	79.0	79.5	77.5
ChartQA	56.9	56.4	55.2	45.4
POPE-f1	86.7	87.7	88.4	85.9
GQA	54.9	55.2	55.9	52.9
TextVQA	66.8	65.2	64.6	59.1
VizWiz	57.7	59.7	59.1	59.1
MME	1421.1	1397.8	1434.9	1309.9

评论- Reply to Reviewer NJjw (Part 2 of 2)

2024-11-22

Is there a comparison of the inference time or flops of different methods?

Thank you for the comment. We have conducted a FLOPS comparison for all models trained within our codebase, which we summarize in the table below (Table 11 in the paper):

Model	Max Resolution	TFLOPs
LLaVA-HD	672 x 672	40.33
LLaVA-UHD	672 x 1008	6.91
Dragonfly	2016 x 2016	41.65
Dragonfly*	2016 x 2016	25.10

Dragonfly’s approach of zooming in and processing multi-crops requires more FLOPs than alternative methods, although it is comparable to LLava-1.5-HD. The computational cost is a trade-off for achieving enhanced performance and supporting higher-resolution images. However, as confirmed by our new ablations discussed above, even 16 tokens per sub-image for high-resolution can perform nearly as well as 36 tokens per sub-image; based on this insight, we trained a Dragonfly* model with 64 tokens for low resolution, 144 tokens for medium resolution, and 576 tokens for high resolution, resulting in a total of 784 tokens. This configuration requires 25.10 TFLOPs for the visual encoding and has performance comparable to the main Dragonfly model, as shown in Table 12.

Our main contribution is demonstrating the benefits of zooming in, and this currently requires more FLOPs to process multiple image sub-crops. An interesting future direction, which we now mention in the Conclusion, is exploring ways to achieve similar performance gains without additional VLM FLOPS, potentially through alternative vision encoder pretraining strategies or post-processing techniques like CLIPSelf or locality alignment to better capture fine-grained details.

评论- Request for rebuttal feedback

2024-11-27

Dear Reviewer,

Thank you for your insightful and constructive feedback. We have carefully addressed the concerns raised and provided additional clarifications and results to improve the manuscript. We sincerely appreciate your consideration of our rebuttal and look forward to your further feedback.

Best regards,

The Authors

评论- Checking if we have addressed the your concerns. Thanks!

2024-12-01

Dear Reviewer,

As we approach the end of the discussion period, we want to thank you for your thoughtful feedback. We have addressed all your valid concerns, and we believe these changes have strengthened our paper. We would greatly appreciate it if you could take a moment to review our rebuttal.

Best regards, The Authors

2024-12-01

Thanks to the authors for addressing some of my concerns. I appreciate the feedback, but I still feel like the method doesn’t have enough of an inspiring or fresh approach. I keep my original rating.

评论- Clarifying Approach. Will appreciate feedback!

2024-12-02

Dear Reviewer,

Thank you for reading our rebuttal and for your reply. We would greatly appreciate it if you could tell us which concerns from your original review were not addressed. We tried to address all of them, in some cases with new experiments:

Motivation for choosing mean-pooling: we determined this choice empirically, as shown in Table 1
Why mean-pooling outperforms other strategies: it has no learnable parameters and is more compatible with our scale of training data, but other strategies could work better given more data
How we chose our pooling window: there’s a trade-off between accuracy and computational efficiency, and we ran new experiments showing that we can compress even more and achieve similar performance (shown in our new Tables 7-8)
Inference time or FLOPs comparison: we quantified FLOPs for all the visual encoding methods implemented in our codebase (shown in our new Table 11)

From your latest response, it seems your main concern at this point is novelty. We will leave to your judgement whether our work is sufficiently fresh, but as we discussed with the other reviewers, it’s distinct from existing native-resolution methods like LLaVA-1.5-HD and LLaVA-UHD. Those methods use multi-crops representations as well, but only a small number because their goal is to avoid destroying information through downsizing. Our contribution is highlighting a new direction for improving visual encodings in VLMs: our results demonstrate that zooming beyond native-resolution leads to significant gains, especially for tasks like text and chart understanding that rely on fine-grained, localized image details. In the short term, this issue can be mitigated by techniques like ours that featurize multiple zoomed-in crops. Longer-term, future work may revisit ViT pre-training or post-training techniques to improve ViTs’ ability to capture more information from single crops (a point we’ve now added to the conclusion). Either way, our work shows that there are clear gains to be achieved from this direction, and that it’s important to solve the weaknesses that limit current ViTs.

评论- One last reminder before the end of discussion period!

2024-12-03

Dear Reviewer,

Thank you for taking the time to review our paper and for providing an initial response to our rebuttal. We are reaching out one final time to remind you of our most recent response. We would greatly appreciate it if you could take a moment to review it and let us know if we have adequately addressed your concerns.

Best regards, The Authors

评论- General Response to clarify Motivation

2024-11-22

We thank the reviewer for taking their time to review our paper and provide insightful feedback. Below, we show result for an additional ablation that we ran to clarify our motivation.

Additional Ablations to Clarify Relative Importance of Zooming In

To further clarify the source of our improvements and differences from prior work,, we conducted a controlled experiment to disentangle the benefits of zooming in and preserving native-resolution details. Our original results in Tables 1-2 show that zooming beyond native resolution enhances performance by capturing fine-grained image details. To assess how much gain comes primarily from solely zooming in, we ran the following new experiment, the results of which are shown in Table 3.

Metric	Low-Resolution	Medium-Resolution from Low-Resolution
AI2D	60.6	62.9
ScienceQA	76.0	77.6
ChartQA	21.6	52.4
POPE	83.4	85.1
GQA	49.5	54.7
TextVQA	40.0	57.4
VizWiz	57.4	58.0
MME Perception	1205.3	1398.9

For the first run, we rescaled all images to a low resolution of $336 \times 336$ and generated 576 tokens from that single image, represented as "Low-Resolution" in Table 3 (and identical to our "low resolution" entry in Table 2). From this setup, we then conducted an experiment where we zoomed in $2 \times$ , generating images of size $672 \times 672$ and producing four crops from the rescaled image. Each crop was passed through the ViT, generating 576 tokens, which we then pooled down to 144 tokens per crop, for a total of 576 tokens across all crops. Now, this approach has the exact same raw image information and final image tokens as the "Low-Resolution" baseline, and its only benefit is zooming in. In Table 3, this is the "Medium-Resolution from Low-Resolution" column, and it outperforms the "Low-Resolution" model in all benchmarks, particularly in tasks like ChartQA and TextVQA where fine-grained information is critical. This shows that solely zooming in and featurizing specific sub-regions, without requiring additional native-resolution image information, significantly improves performance.

For further details, please refer to Section 4.4 in our updated submission.

评论- Gentle Reminder: Feedback on Rebuttal

2024-11-26

Dear Reviewers,

I hope this message finds you well. I am writing to kindly remind you of the rebuttal submitted for our paper in response to the initial reviews. We have addressed the concerns raised and provided additional clarifications and results to improve the manuscript.

We would greatly appreciate it if you could take a moment to review our responses and share any further feedback or questions you may have. Your insights are invaluable to us in refining our work.

Thank you for your time and consideration. We look forward to your comments.

AC 元评审

2024-12-18

This submission received two negative sores and a positive score after rebuttal. After carefully reading the paper, the review comments, the AC can not recommend the acceptance of this submission, as the average score is under the threshold bar and the concerns about the proposed approach remain. The AC also recognizes the contributions confirmed by the reviewers, and encourages the authors to update the paper according to the discussion and submit it to the upcoming conference.

审稿人讨论附加意见

This submission was fully discussed during the rebuttal period. While most concerns of reviewer#etVR were solved, those (novelty and effectiveness) from Reviewer #NJjw were not fully approved.

最终决定Reject

2025-01-22

Reject

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

摘要

评审与讨论

优点

缺点

问题

优点

缺点

问题

优点

缺点

问题

Additional Ablations to Clarify Relative Importance of Zooming In

审稿人讨论附加意见