We thank you for the positive review and constructive comments.

The main weakness of the work is the superficial ablation of the VESL captioning pipeline introduced.

Thank you for your constructive comments, and we will consider better examples in our figures. We agree with the reviewer that image captioning could suffer from hallucination, though it enriches the visual description. The reviewer is also correct that our pipeline filters such cases by relying on the object detector, which is quite effective as it was pre-trained on over thousands of common objects. A central design challenge for our pipeline is balancing the benefits of richer captions with the potential risk of hallucination. We believe our design is robust, considering the effectiveness of current state-of-the-art open-vocabulary object detectors and the inherent resilience of contrastive learning objectives to noisy text annotations in large-scale training.

In Table 2 (rows 3-5 vs. 13-15), we compare with the baseline that does not use the image captioner but the AltText (e.g., left of Figure 3). It could be hard to extract useful object phrases from AltText for the open-vocab detector.

For the hallucination issue, we have investigated our captioning pipeline and compared it with other recent works to show its superiority in terms of the hallucination. We compared it to public models: LLaVA-1.5, Shikra, MiniGPT-4, and InstructBLIP on CHAIR scores [1] (lower values refer to less hallucination). CHAIR_i measures the fraction of hallucinated object instances and CHAIR_s calculates the fraction of sentences containing at least one hallucinated object. The results are summarized below:

Captioner	CHAIR_i	CHAIR_s
InstructBLIP	14.5	30.0
MiniGPT-4	8.2	24.2
Shikra	7.0	22.0
LLaVA-1.5	6.2	20.6
Ours	5.9	19.6

While the long captions from our captioning pipeline may still unavoidably introduce hallucinations even though our captioning pipeline has less hallucinations compared to other models, it can be further mitigated since we only consider confident objects agreed by the detector (L296 left). We also remove very generic words and stopwords, as noted in the code in Listing 1 in the appendix. We believe having more accurate object labels is the key to the improvements in Table 2 for our pipeline, as evidenced by the statistics 11.6 regions per image identified by the pipeline (only 5.1 for the baseline) we report in Table 1 (L275 right).

lines 260-262 (right column): are the pairs ignored or are the gradients on ignored?

For filtering region-text conflicts in Section 3.4, the region-text pairs are ignored in Equation 2. That is, these elements are “masked” and will not be considered in the contrastive loss matrix. We will make this clearer in the final version.

Remarks on CLIP

We apologize for any confusion regarding this statement. Our reference to “CLIP” in the context of MLLMs was intended to denote the broader family of language-supervised methods, including both CLIP (as a representative model) and SigLIP. Specifically, we were citing the first row block in Table 12 of Section D (Tong, 2024), which demonstrates that these methods outperform others, such as self-supervised approaches, in MLLMs. We will make this clearer and revise both the introduction and related works as you suggested.

Typos

Thank you for pointing out the typos, and we have fixed them in the revised manuscript.

[1] Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T. and Saenko, K., 2018. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156.