PaperHub
5.5
/10
Poster4 位审稿人
最低1最高4标准差1.2
4
3
4
1
ICML 2025

Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

OpenReviewPDF
提交: 2025-01-10更新: 2025-08-29

摘要

关键词
text-to-image retrievaltext-to-video retrievalplug-and-play

评审与讨论

审稿意见
4

The paper introduces a test‐time, plug‐and‐play approach VISA for text-to-visual retrieval. VISA converts visual content into dense, natural language descriptions using off‐the-shelf large multimodal models (LMMs). It then refines these descriptions via a question-answering module that leverages chain-of-thought prompting to capture query-specific details. The approach addresses two main challenges: (1) filtering out low-level, redundant visual details and (2) addressing granularity mismatches between textual queries and visual content. Extensive experiments on both image (e.g., MS-COCO, Urban1k) and video (e.g., MSR-VTT, LSMDC) datasets demonstrate that VISA improves retrieval performance over several state-of-the-art methods.

给作者的问题

  1. How about your method compared with multimodal retrieval methods that use LMMs?

论据与证据

The paper claims that converting images and videos into rich textual representations can boost retrieval performance. This claim is supported by comprehensive experimental results, e.g., consistent improvements in recall metrics across diverse datasets and detailed ablation studies that highlight the contribution of each component (general description, QA refinement, chain-of-thought). Overall, the evidence is clear and convincing.

方法与评估标准

The proposed method is well-motivated and leverages the strengths of existing LMMs and LLMs without requiring additional training. The experimental design is thorough, employing both short-context and long-context retrieval tasks with appropriate benchmarks and evaluation metrics. This makes the method not only conceptually sound but also practical for real-world applications.

理论论述

There are no theoretical claims in this work, which is consistent with the paper’s focus on a practical, empirically driven solution.

实验设计与分析

The authors compare VISA against multiple baselines on a range of datasets and perform extensive ablation studies that demonstrate the robustness and effectiveness of the proposed components. The analysis including comparisons of inference time provides a well-rounded view of the method.

补充材料

The supplementary material contains additional visualizations, prompt details, and extended ablation results.

与现有文献的关系

The paper discusses the current literature on vision-language models, such as CLIP, Frozen, and recent works on fine-grained retrieval enhancement. It builds on prior findings by addressing the inherent limitations of training-based approaches and provides a novel test-time alternative.

遗漏的重要参考文献

While the reference list appears comprehensive, a discussion of some very recent works on multimodal retrieval methods that use LMMs might further strengthen the literature context. However, no critical prior work seems to be omitted.

其他优缺点

Strengths:

– The paper presents an innovative plug-and-play framework that sidesteps the need for expensive retraining. It integrates with existing vision-language models without additional training and leverages pre-trained models at test time, significantly reducing computational costs.

– The paper features extensive experimental validation and rigorous ablation studies. The method is evaluated on multiple benchmark datasets, and the ablation studies clearly demonstrate the contribution of each module.

– The method achieves clear improvements over established baselines across diverse datasets. Empirical results show consistent gains in key metrics on both image and video retrieval tasks, highlighting its effectiveness in addressing semantic redundancy and granularity mismatches.

Weaknesses:

– The reliance on large off-the-shelf models at test time could raise practical deployment concerns regarding computational latency.

– The paper lacks experiments evaluating the method from an image-to-text perspective. Including standard I2T benchmarks would provide additional insights into the robustness and generality of the approach.

其他意见或建议

NA

作者回复

Q1:Concern of computational latency

We would like to clarify that our method is compatible with smaller, more efficient retrievers, and we have observed that models with significantly fewer parameters (e.g., 400M or 1.5B) still achieve comparable performance (see our response to Reviewer 1ri6, Q1).

In addition, the general description module can be executed entirely offline, prior to deployment. This design ensures that it incurs no cost during online inference. As noted in our response to Reviewer 1FJx (Q1), even when the QA component is omitted, VISA still delivers competitive performance relative to existing VLMs. In such case, only the text retriever operates in real-time, and its computational overhead can be adjusted by choosing a model of a suitable size.

Together, the use of lightweight retrievers and offline gallery processing make VISA practically deployable and efficient, even in resource-constrained settings.

Q2:Image-to-text (I2T) retrieval experiments

We initially did not include text-to-image (T2I) experiments because our primary focus is on supporting diverse user retrieval demands with varying levels of query granularity—a scenario more naturally aligned with T2I settings.

To address this concern, we conducted image-to-text (I2T) retrieval experiments on Flickr30K, MSR-VTT, and Urban1K, using the same retrieval setup and backbones as in the T2I setting. Specifically, for each query image, we first retrieve the top-20 candidate text descriptions. For each candidate, we generate three questions and append the corresponding answers derived from the image. Finally, we re-rank the candidates using text-level retrieval.

MethodFlikcr (R@1|R@5|R@10)
SigLIP94.4|99.7|99.8
SigLIP + VISA95.0|99.8|100.0
BLIP-297.6|100.0|100.0
BLIP-2 + VISA97.9|100.0|100.0
MSR-VTT (R@1|R@5|R@10)
InternVideo2-G49.6|73.4|81.0
InternVideo2-G + VISA54.4|78.0|84.8
Urban1k (R@1|R@5|R@10)
LoTLIP89.6|97.8|98.9
LoTLIP + VISA93.7|99.0|99.5

These consistent improvements across datasets and backbones confirm that VISA not only enhances T2I retrieval, but also generalizes effectively to I2T retrieval.

Q3:Compared with multimodal retrieval methods that use LMMs

As there are only a few works directly built on LMMs, we have compared our method with RAGVL in Table 1. On Flickr30k, our approach achieves higher performance with R@1 = 85.1 compared to RAGVL’s 84.4 and further improves to 86.1 when using EVA-CLIP. The plug-and-play design does not require end-to-end fine-tuning of large models.

审稿意见
3

This paper introduces Visual Abstraction (VISA), a novel test-time approach for enhancing text-to-visual retrieval by converting visual content into textual descriptions using large pre-trained models. VISA utilizes a question-answering mechanism to refine these descriptions to match the granularity of user queries accurately. The approach demonstrates superior performance over existing methods across several benchmark datasets for both image and video retrieval tasks.

给作者的问题

  1. This work integrates a text model into existing cross-modal frameworks. Could you explore the performance outcomes of integrating two multimodal models (VLMs) of similar parameter scales?
  2. In the question-answering process, how does the system distinguish between definitive 'No' and 'Uncertain' responses? What implications do these responses have on the retrieval process?
  3. How does VISA handle queries that could be interpreted in multiple valid ways visually? Are there specific mechanisms or algorithms within VISA that help resolve or manage such ambiguities? The paper does not fully address how VISA manages ambiguous queries where multiple visual interpretations are possible.

论据与证据

The paper's claims about the effectiveness in improving text-to-visual retrieval are well-supported by extensive experimental results. The authors present a comprehensive set of experiments and comparisons that demonstrate the approach's superiority over existing state-of-the-art methods across various datasets and query types.

方法与评估标准

The proposed methods are logically sound and well-aligned with the problems at hand. The evaluation is conducted using standard benchmark datasets like COCO, Flickr30K, and others, which are appropriate for the task. The use of large pre-trained models and a novel question-answering refinement process adds to the methodological robustness.

理论论述

NA

实验设计与分析

The experimental evaluations are thorough, with detailed ablation studies and comparisons to baseline and state-of-the-art models. The authors effectively demonstrate the benefits of their approach using both quantitative metrics and qualitative examples.

补充材料

The supplementary material provides additional resources that support the paper's claims, the visualization results are interesting.

与现有文献的关系

The paper is related to vision-language pre-training and retrieval tasks.

遗漏的重要参考文献

I do not identify any critical references or related works that are missing from the discussion.

其他优缺点

This paper introduces a unique test-time enhancement for text-to-visual retrieval by converting visual content into textual descriptions. This method diverges from traditional training-time modifications, offering a fresh perspective on improving retrieval effectiveness. The plug-and-play nature of VISA allows it to be seamlessly integrated with existing visual-language models. This compatibility is a significant advantage for enhancing the recommended system without extensive redevelopment.

However, the paper has several limitations:

  1. The paper primarily focuses on integrating text-based retrieval enhancements into existing cross-modal frameworks but does not explore the potential of integrating two multimodal models.
  2. The differences between 'No' and 'Uncertain' responses in the QA process are not well-explained. It is unclear whether these responses are treated differently in the retrieval ranking process.

其他意见或建议

NA

作者回复

Q1:Integrating VLMs

Thank you for this insightful suggestion. To explore this idea, we integrate SigLIP (as the base model) with EVA-CLIP (18B parameters), which has a significantly higher parameter count compared to the text retrieval model gemma-9B. This hybrid setup yields notable performance gains over SigLIP alone:

MethodCOCO (R@1|R@5|R@10)Flickr (R@1|R@5|R@10)
SigLIP54.2|76.8|84.283.0|96.1|98.0
SigLIP+EVA-CLIP56.0|79.1|85.984.1|96.7|98.3
SigLIP+VISA (Ours)57.2|80.3|86.985.1|97.1|98.6

Notably, our proposed method achieves even greater improvements on both datasets. This suggests that generative visual abstraction offers more effective enhancement than simply fusing large VLMs. Importantly, VISA maintains its plug-and-play design, requiring no model retraining or architecture modification. Additionally, as detailed in our response to Reviewer 1ri6 (Q1), our method is compatible with smaller, more efficient text retrievers (e.g., 400M and 1.5B parameters), which also achieves strong performance.

Q2:The explanation of 'No' and 'Uncertain' responses

Our core motivation is to abstract visual content into textual descriptions, enabling more effective reasoning and relationship modeling in the text space. In the question-answering module, we distinguish between two types of negative responses: "No" and "Uncertain", each serving a different purpose:

  • "No" is used when the question refers to a subject that is not present in the image, or when a specific condition, action, or attribute does not apply. For example, if the image shows a person in red shirt playing basketball and the question asks, "Is the person in red shirt playing football?", the correct response would be "No". This type of response provides explicit negative evidence which helps refine semantic alignment.
  • "Uncertain" is used when the question cannot be reliably answered based on the visual content, e.g., asking about the shirt color of someone playing football when no such activity is depicted. In these cases, the response is considered uninformative, and the corresponding QA pair is discarded from downstream processing.

Q3:How VISA manages ambiguous queries where multiple visual interpretations are possible.

VISA incorporates LLMs to manage query ambiguities by leveraging context to disambiguate meanings. For example, in a query like "How to do something in windows?" (as shown in Figure 5 of the main paper), the surrounding context helps the model infer that "windows" refers to the operating system Windows instead of glass windows. In the future, we plan to further enhance this capability with reflection mechanism to check the visual interpretations.

审稿意见
4

This paper proposes Visual Abstraction (VISA), a plug-and-play approach designed to enhance text-to-visual retrieval. Unlike traditional retrieval methods that operate in a cross-modal embedding space, VISA transforms visual content into textual descriptions using off-the-shelf large models. This transformation filters out redundant low-level visual details. Additionally, VISA incorporates a question-answering mechanism to refine descriptions based on user-specified granularity. Extensive experiments demonstrate that VISA significantly improves retrieval performance across text-to-image and text-to-video tasks, outperforming state-of-the-art models on both short- and long-context queries. The approach requires no additional training and can be seamlessly integrated into existing retrieval systems.

给作者的问题

Please refer to weaknesses.

论据与证据

Claim 1: VISA enhances retrieval by converting visual content into text, filtering out redundant details, and improving alignment with textual queries.

Evidence: Experiments show that VISA improves recall@1 across multiple datasets, demonstrating better retrieval performance.

Claim 2: The question-answering process helps refine descriptions, ensuring a more precise match to user queries. Evidence: Ablation studies indicate that removing the QA process reduces retrieval accuracy, confirming its contribution to performance.

方法与评估标准

After carefully checking the manuscript, the proposed methods and evaluation criteria make sense for text-to-visual retrieval.

理论论述

There is no theoretical claim.

实验设计与分析

Experimental evaluations are comprehensive.

补充材料

The supplementary material is well-organized.

与现有文献的关系

VISA relates to prior vision-language pretraining models such as CLIP, EVA-CLIP, and FLAME, but differs by transforming retrieval into a text-enhance problem

遗漏的重要参考文献

The paper sufficiently covers relevant references.

其他优缺点

strengths:

  1. VISA introduces a plug-and-play test-time retrieval enhancement, avoiding the need for expensive retraining.
  2. VISA is model-agnostic and can enhance retrieval performance without modifying existing models.
  3. Experimental results show consistent recall@1,5,10 gains across multiple datasets.

weaknesses:

  1. Although VISA is plug-and-play, text-based retrieval process introduces additional computational overhead. Does text retrieval necessarily require using LLMs with over 7B parameters? This parameter scale is significantly larger than most vision-language models (VLMs), which may raise concerns about efficiency and deployment feasibility.
  2. What is the impact of abstraction length on retrieval performance? Does generating longer textual descriptions always improve accuracy, or is there an optimal level of abstraction that balances efficiency and effectiveness?
  3. Can VISA be extended beyond text-to-visual retrieval? Could this approach be generalized to other cross-modal tasks while maintaining its plug-and-play nature?
  4. The paper lacks image-to-text (I2T) retrieval experiments. While the work focuses on text-to-image retrieval from a user-driven perspective, cross-modal retrieval generally includes both T2I and I2T tasks. It would be valuable to see how VISA performs in the reverse retrieval direction.

其他意见或建议

The paper lacks citations for InternVideo in Table 4. Please ensure that all referenced methods are properly cited to maintain clarity.

伦理审查问题

No

作者回复

Q1:Efficiency Concerns

Thanks for the suggestion regarding the efficiency of using large text retrievers. Importantly, VISA does not require LLMs with over 7B parameters to work effectively. To demonstrate this, we conducted experiments using smaller text retrievers with 400M and 1.5B parameters, and the results are summarized below:

Text RetrieverFlickr30K (R@1|R@5)MSR-VTT (R@1|R@5)Urban1k (R@1|R@5)
None83.1|95.852.0|74.685.9|97.1
stella-400M85.2|97.253.4|75.093.3|98.9
stella-1.5B85.4|97.153.6|75.493.4 | 98.9
gemma2-9B86.1|97.354.4|75.394.6|99.4

As shown, even relatively small models such as Stella-400M yield significant performance gains over the baseline without a text retriever. This demonstrates that the performance gains from VISA are not solely due to model size, but rather arise from the integration of visual abstraction into the retrieval pipeline. These findings reinforce the generalizability and flexibility of VISA. In scenarios where inference efficiency is a priority, smaller and more efficient models can be adopted without substantial performance loss. For additional discussion on latency and FLOPs, please refer to our response to Q1 of Reviewer 1FJx.

Q2:Impact of abstraction length

We explored how varying the length of the generated general descriptions affects retrieval performance. Specifically, we modified the prompt for the captioning model to: “Please generate descriptions of the given image in approximately {num} words.” The corresponding results are available at the anonymous URL: https://imgur.com/a/amTI9n5.

Our results show that longer visual captions generally improve retrieval performance, as they encode richer semantic information such as object attributes, spatial relationships, and contextual details. However, we also observe a clear performance saturation beyond a certain caption length. That is, once the caption adequately captures the essential visual content, adding more tokens results in diminishing returns. Importantly, the optimal caption length depends on the complexity of the query. For short-context queries (e.g., Flickr30K), moderately long captions are sufficient to achieve strong performance. In contrast, long-context queries(e.g., Urban1K) benefit from longer, more detailed captions that better capture fine-grained visual elements.

Q3:Using VISA on other cross-modal tasks

We believe that the visual abstraction mechanism is broadly applicable to other cross-modal tasks. For example:

  • Video Moment Retrieval: By segmenting a video into temporal clips and generating visual abstractions for each segment, one can match a textual query to the semantic descriptions of these segments, thereby localizing relevant time intervals.

  • Text-based Re-identification and Composed Image Retrieval (CIR): This can benefit from generating abstractions for both gallery items and query items (such as image queries in CIR) and comparing them in the language space.

We plan to include this discussion in the next version.

Q4:Image-to-text (I2T) retrieval

We initially did not include text-to-image (T2I) experiments because our primary focus is on supporting diverse user retrieval demands with varying levels of query granularity—a scenario more naturally aligned with T2I settings.

To address this concern, we conducted image-to-text (I2T) retrieval experiments on Flickr30K, MSR-VTT, and Urban1K, using the same retrieval setup and backbones as in the T2I setting. Specifically, for each query image, we first retrieve the top-20 candidate text descriptions. For each candidate, we generate three questions and append the corresponding answers derived from the image. Finally, we re-rank the candidates using text-level retrieval.

MethodFlikcr (R@1|R@5|R@10)
SigLIP94.4|99.7|99.8
SigLIP + VISA95.0|99.8|100.0
BLIP-297.6|100.0|100.0
BLIP-2 + VISA97.9|100.0|100.0
MSR-VTT (R@1|R@5|R@10)
InternVideo2-G49.6|73.4|81.0
InternVideo2-G + VISA54.4|78.0|84.8
Urban1k (R@1|R@5|R@10)
LoTLIP89.6|97.8|98.9
LoTLIP + VISA93.7|99.0|99.5

These consistent improvements across datasets and backbones confirm that VISA not only enhances T2I retrieval, but also generalizes effectively to I2T retrieval.

审稿意见
1

The paper studies the problem of text-to-visual retrieval, which involves both text-to-image retrieval and text-to-video retrieval. The authors propose a framework to enhance the retrieval via converting the visual content to the text domain, and then do the retrieval. Experiments show improvement of the proposed method over several previous models.

给作者的问题

  1. Is it possible to include the FLOPS comparison of the proposed method and the original model?
  2. Is it possible to include BLIP-2 results for text-to-image retrieval and compare with it?

论据与证据

The authors claim they achieve state-of-the-art performance, while their numbers for text-to-image retrieval is lower than the actual state-of-the-art BLIP-2.

方法与评估标准

The proposed method uses LLM for better text-to-image retrieval. One biggest advantage of using CLIP/SigLIP for text-to-image retrieval is that it is efficient. But with LLM introduced and runned for multiple times as in the proposed method, the FLOPS may be significant increased so it is against the motivation of developing an efficient retrieval system.

Also, at the evaluation side, the authors are not comparing with state-of-the-art architectures, such as BLIP-2.

理论论述

N/A

实验设计与分析

The authors are not comparing with state-of-the-art architectures, such as BLIP-2, while they claim they achieve the state-of-the-art results. It is a significant problem of doing this. Actually their numbers for text-to-image retrieval is lower than BLIP-2.

补充材料

No

与现有文献的关系

Text-to-visual retrieval is an important task, and effective innovations for the task will be of a wide interest. But my concern is the authors have not honestly compared with the well-known state-of-the-art models while making such a claim. Besides, they have not considered about the efficiency of retrieval, which is very important factor in the community.

遗漏的重要参考文献

The authors have not properly compared their methods with state-of-the-art BLIP-2.

其他优缺点

I think the major weakness of the paper is the evaluation and method design.

其他意见或建议

N/A

作者回复

Q1: Efficiency Concerns (FLOPs and latency)

Thank you for the valuable suggestions. We include the FLOPs and latency comparisions on the Flickr dataset below. For clarity, we divide the retrieval process into two stages:

  • Offline stage precomputes visual features (via VLM) and generates general descriptions (via VISA) for the gallery candidates.
  • Online stage performs the text encoding (via VLM), QA-based description refinement and text-level re-ranking (both via VISA) during inference.
TypeModuleGFLOPsLatency (second/dataset)
offlineVLM (SigLIP/EVA-CLIP)335/45604/18
offlineGeneral Description (LLaVA-v1.6-34B)138000437.5
TypeModuleGFLOPsLatency (second/per query)
onlineVLM (SigLIP/EVA-CLIP)26/90.0002/0.0029
onlineQuestion Generator (Qwen2.5-32B)268700.02
onlineAnswer Generator (Qwen2VL-7B)154501.00
onlineText Retriever (gemma-9B)41600.13

In the offline stage, generating general descriptions using large multimodal models (LMMs) indeed incurs high FLOPs (138,000 vs. 335 compared with SigLIP, approximately 400×). However, this process could be executed ahead of time and does not influence the real-time inference latency. In the online stage, inference overhead is significantly reduced through techniques like KV cache and parallel execution, supported by efficient frameworks such as SGLang. In practice, this results in only ~1 second of additional latency on an NVIDIA 4090 server compared to the original SigLIP pipeline.

We acknowledge that VISA introduces a trade-off between efficiency and performance, and it may not be ideal for strict low-latency applications. However, we would like to emphasize two key points:

  • Test-Time Augmentation Paradigm. To our knowledge, VISA might be the first work to systematically explore test-time augmentation for VLM-based retrieval. This paradigm aligns with trends in LLM research (e.g., test-time scaling), where huge additional compute yields significant performance gains. Specifically, VISA achieves average R@1 improvements of +2.3% (video) and +6.7% (long-context image retrieval). We also include BLIP-2 comparisons below, showing VISA improves over BLIP-2 by +0.6% (COCO) and +0.5% (Flickr30K) in R@1.
  • Flexible Trade-Offs. Even using only the General Description module (without QA) yields strong performance boosts of +2.5%, +1.3%, and +8.3% R@1 across three tasks (see Table 6(a)), offering a lighter-weight deployment option (0.13 second) when needed.

Q2: Comparison to BLIP-2

Thanks for highlighting this important baseline. We initially did not compare against BLIP-2 because our evaluation follows the widely adopted zero-shot retrieval protocol in recent VLM research. In contrast, BLIP-2 is finetuned on COCO and then evaluated on COCO and Flickr, making it less directly comparable to the evaluation setup in Table 1.

Here we include BLIP-2 in our evaluation (referencing Table 5 from the BLIP-2 paper). As shown below, VISA continues to yield improvements when applied on top of BLIP-2. This demonstrates the compatibility and effectiveness of our method even when applied to strong VLMs like BLIP-2.

MethodCOCO (R@1|R@5|R@10)Flickr (R@1|R@5|R@10)
BLIP65.1|86.3|91.886.7|97.3|98.7
BLIP-268.3|87.7|92.689.7|98.1|98.9
BLIP-2 + VISA(Ours)68.9|88.0|92.990.2|98.4|99.2

More broadly, while BLIP-2 is highly competitive for short-query retrieval tasks such as COCO and Flickr30K, our method demonstrates notable improvements on long-text queries and video retrieval tasks as well, across 10 widely-used datasets. This highlights the generality of VISA as a plug-and-play enhancement applicable to a range of backbone models.

审稿人评论

Thank the authors for the rebuttal. I have read the rebuttal by the authors and other reviewers’ comments. The biggest concerns as I proposed have still not been addressed, while shared by other reviewers (1ri6, nqY2).

In the rebuttal, the authors provided some comparison regarding the efficiency, but there are several points I want to bring to the AC and other reviewers’ attention:

  • (i) The cost of the offline stage the method is over 400 times higher than the baseline, which introduces a huge burden if the number of retrieval samples becomes large, it is definitely something we cannot ignore.
  • (ii) The authors make the points that the method ‘offering a lighter-weight deployment option (0.13 second) when needed’, but they hide it is actually 650 times the cost of the baseline. This again brings a significant more cost when there are lots of samples.
  • (iii) The authors mention something irrelevant to make their points, which is misleading. For example, the authors mention about the efficient frameworks, but this is something that will result in the improvement over all architectures. The significantly high cost the proposed method introduced is essentially not changed.

The architecture the authors proposed is very complicated compared with the original CLIP/SigLIP architectures, but only gives marginal boost on state-of-the-art BLIP-2. The reason the authors provided in the rebuttal is not convincing to me - although BLIP-2 is fine-tuned on COCO, it is still zero-shot on Flickr, and is on the leaderborad of PaperWithCode. I suspect that the authors hide their comparison with BLIP-2 because the improvement is much less than the other backbones. It is not honest to make the state-of-the-art claim in the submission.

Additionally, can the authors guarantee that the LLMs they used is zero-shot on COCO and Flickr? This raises another significant concern that the entire method the author proposed is not making a fair comparison with the previous models.

Essentially, I don’t think the paper is making a proper contribution to the text-to-visual retrieval community.

Given the above reasons, I think this paper is not enough and not ready for ICML and I therefore vote for strong rejection of the paper.

最终决定

This paper presents a novel method named Visual Abstraction (VISA), which is a plug-and-play approach designed to enhance text-to-visual retrieval. In particular, this approach transforms visual content into textual descriptions using off-the-shelf large models. Then, the generated text descriptions with their dense semantics would naturally filter out low-level redundant visual information. Finally, this approach adopts a question-answering process to enhancing the text description with the specific granularity information requested by the user.

Reviewers agreed that this paper presents a novel approach for test-time text-visual retrieval. The paper is well written and easy to follow. Reviewers raised some questions about baselines, image-to-text experiments, and efficiency. Most of them have been well addressed by the detailed responses from the authors. One reviewer still concerned on the efficiency and the comparison with BLIP-2. However, efficiency is not claimed as a contribution of this work, and there is often a trade-off between inference-time cost and performance.

The authors are strongly encouraged to incorporate the new results and discussions to their camera-ready version. For instance, the authors may consider adding the results of BLIP-2 to the main table in shaded gray (similar to the InternVL paper that also points out the improved performance of BLIP-2 on Flickr30K due to finetuning on COCO).