PaperHub
7.8
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
3
5
4
ICML 2025

Contrastive Localized Language-Image Pre-Training

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
CLIPMLLMFoundation Models

评审与讨论

审稿意见
4

This paper explores a data-driven approach to enhance the regional representation capabilities of CLIP. The authors designed a data annotation pipeline to expand regional-level annotations and developed a training architecture featuring a Prompter. This architecture enables more effective utilization of the annotated data for fine-grained training. Experimental results, particularly those obtained under certain MLLM settings, demonstrate the advantages of the proposed method.

给作者的问题

Please refer to the 'Weaknesses' part.

论据与证据

Yes.

方法与评估标准

Yes

理论论述

No proofs for theoretical claims.

实验设计与分析

The experimental designs and analyses in the paper have several sound aspects, but also some areas that could be further explored. Please refer to the 'Weaknesses' part for more details.

补充材料

No supplementary material.

与现有文献的关系

Compared with previous work, the main contributions of this paper are: 1) construction of a large-scale region-text dataset; 2) design of a fine-grained pre-training paradigm; 3) exploration of scalability in MLLM scenarios.

遗漏的重要参考文献

Please refer to the 'Weaknesses' part for more details.

其他优缺点

I believe this is a high-quality paper, and I greatly appreciate the authors' contributions in terms of the data pipeline, training architecture, and MLLM evaluation. However, I have the following concerns:

  1. UMG-CLIP [ECCV 2024] focuses on similar issues as this paper. Nevertheless, there is a lack of comparison with it in the authors' discussion. I expect to see more in-depth discussions, including: a) Data annotation process. UMG-CLIP first uses an open-vocabulary detector to predict bounding boxes and then generates descriptions for each box. In contrast, this paper first identifies entities and then predicts bounding boxes for each entity (somewhat similar to RegionCLIP). The advantages and disadvantages of these two pipelines need to be explored. In particular, I noticed that UMG-CLIP claims to have good accuracy for bounding boxes (Table 14 in their paper), and its region-level descriptions seem to be more detailed (Figure 3 in their paper). b) Training architecture. The main difference between this paper and UMG-CLIP appears to be the replacement of ROI-ALIGN with a prompter. The authors claim that this is because the inaccurate pseudo-annotations limit the effectiveness of ROI-ALIGN. If this is the case, could ROI-ALIGN be considered for use when ground-truth annotations are available? Additionally, the paper lacks an evaluation of annotation accuracy. Does this imply that the bounding boxes generated by this paper are of poor quality, and the training architecture is a compromise due to the low-quality data?
  2. I look forward to seeing more validation results, such as experimental verifications in open-world detection (similar to RegionCLIP) or segmentation scenarios (similar to UMG-CLIP).

其他意见或建议

Please refer to the 'Weaknesses' part.

作者回复

We thank you for the positive review and constructive comments.

UMG-CLIP

We thank you for pointing out the reference that we briefly compared in L363 left. We agree that there is some technical similarity between our CLOC and UMG-CLIP, but the goals and positioning of our work and theirs are quite different from the following perspectives.

UMG-CLIP is designed primarily for vision-centric tasks such as detection and segmentation, whereas our focus is on MLLMs (L33, right in the Introduction). Unlike open-vocabulary vision tasks, which involve a relatively small set of classes, MLLM VQA tasks require more extensive language understanding and thus demand large-scale pre-training data. This distinction also influences the design of our prompter architecture, which incorporates attention layers tailored to the downstream use cases of the LLM decoder (further discussed in L256, left).

Regarding the annotation pipeline, the primary difference lies in scale. UMG-CLIP fine-tunes a pre-trained CLIP model for dense vision tasks on a 41M-image dataset, whereas our approach pre-trains from scratch on up to 2B images. Additionally, unlike UMG-CLIP and RegionCLIP, which first predict bounding boxes and then run a captioner on each box, our VESL pipeline (Section 4) does not scale with the number of boxes, making it significantly more efficient for data annotation. Notably, annotating billions of images for our experiments required over 500+ GPUs for more than a week (L757, right).

Last but not the least, we see the UMG-CLIP fine-tuning approach as complementary to our CLOC pre-training rather than conflicting with it.

Open-world detection

We thank you for the advice on evaluating our encoder on open-vocab dense vision tasks. First, we want to emphasize that our original motivation for this work is for MLLM tasks with localization use cases such as conversational referring and grounding (e.g., Table 4), but not for dense vision tasks. And in Table 2, we have reported competitive results of zero-shot object recognition and retrieval given bounding boxes.

To further address your concern, for open-vocabulary detection, we provide additional zero-shot evaluation results on COCO Detection (minival), ODinW (test-dev), and LVIS-Det (minival). When comparing GLIP [1] and CLOC, we observe that CLOC consistently achieves better results than GLIP across all backbone categories (T / B / L), suggesting that CLOC offers advantages in localization and object detection performance. Notably, GLIP employs DyHead—a strong decoder/head module—on top of the encoder, whereas our ablation study uses only two simple heads for classification and regression. This further supports that the encoder representation in CLOC is indeed superior. See the table below for detailed results.

ModelViTCOCO-Det (minival)ODinW (test)LVIS-Det (minival)
GLIP-TViT-T/1646.646.526.0
GLIP-LViT-L/1449.852.137.3
CLOC-BViT-B/1647.348.429.6
CLOC-LViT-L/1450.853.638.1

[1] Grounded Language-Image Pre-training, CVPR 2022.

审稿人评论

Thanks for your response. I will improve my rating.

审稿意见
3

This work introduces a dynamic attention mechanism, inspired by SAM, to aggregate regional image features and perform contrastive learning at both the image-text and region-text levels. The approach is novel in the context of visual pretraining.

给作者的问题

Will you release the pretrianed model (including the Prompter) to the public?

论据与证据

The experiments (Table 2, 3) demonstrate strong results in region-aware visual pretraining compared to CLIP. However, it remains unclear whether the performance gains stem from the use of more or cleaner data or from the pretraining schema itself.

方法与评估标准

The methods seem reasonable and elegant to me. However, the evaluation should be more comprehensive, including comparisons with other location-aware pretraining methods (e.g., LocCa, RegionCLIP) on the same benchmarks, such as RefCOCOs.

理论论述

I checked, please see "Other Strengths And Weaknesses"

实验设计与分析

Can this work be seamlessly extended to zero-shot object detection?

补充材料

Yes, scan throught the dataset

与现有文献的关系

Visual encoder pretraining

遗漏的重要参考文献

It covers well, need discuss more on training efficiency.

其他优缺点

Weakness:

  1. A broader comparison with additional methods, as mentioned above.
  2. More discussion and comparisons on training efficiency (e.g., with CLIP, LocCa) would be valuable.
  3. The general issue of attribute binding, inherited from CLIP, is not well addressed.

Minor:

  1. For Eq2, Please pay attention to the superscripts in the formula, especially since mm' is not defined.
  2. For Equ4, please well define LCLOCL_{CLOC}

其他意见或建议

See above

作者回复

We thank you for the positive review and constructive comments.

Unclear whether the performance gains stem from the use of more or cleaner data or from the pretraining schema itself.

In Table 2, we provide detailed ablations of our proposed ingredients on top of the CLIP we trained by ourselves, including the prompter design, the training labels generated by our pipeline. Note that, the CLIP model (row 2) we compared in the experiments was trained on the same image data as CLOC for a fair comparison.

Can this work be seamlessly extended to zero-shot object detection?

Yes, CLOC can be extended to zero-shot object detection. We provide additional evaluation results on COCO Detection (minival), ODinW (test-dev), and LVIS-Det (minival). When comparing GLIP [1] and CLOC, we observe that CLOC consistently achieves better results than GLIP across all backbone categories (T / B / L), suggesting that CLOC offers advantages in localization and object detection performance. Notably, GLIP employs DyHead—a strong decoder/head module—on top of the encoder, whereas our ablation study uses only two simple heads for classification and regression. This further supports that the encoder representation in CLOC is indeed superior. See the table below for detailed results.

ModelViTCOCO-Det (minival)ODinW (test)LVIS-Det (minival)
GLIP-TViT-T/1646.646.526.0
GLIP-LViT-L/1449.852.137.3
CLOC-BViT-B/1647.348.429.6
CLOC-LViT-L/1450.853.638.1

A broader comparison with additional methods

Thank you for the suggestion. We agree that a broader comparison to more encoders would certainly be great. However, many previous models (LocCa, RegionCLIP) are trained on quite different data, labels, training cost, architectures etc. that might make it hard to draw a fair comparison directly, and some of them are not open-sourced. Therefore, we have limited the scope of our paper to the CLIP method and carefully ablated it in the same setting (e.g., training images, number of steps) as closely as possible.

More discussion and comparisons on training efficiency (e.g., with CLIP, LocCa)

We provide discussion in the “Training cost” paragraph (L789 left). Compared to CLIP, the extra cost is small for the object-level contrastive loss and the prompter — we observe about 10% more GPU time. Notably, the lightweight prompter operates on the image embedding that is shared across all the prompts within an image. The main overhead is to compute the image embedding through the ViT, which does not scale with the number of prompts. Compared to LocCa, our CLOC is much more lightweight since LocCa needs a full encoder-decoder transformer for autoregressive next-token prediction.

Attribute binding issue of CLIP

In this paper, indeed, we do not tailor to address the attribute binding shortcoming of the original CLIP directly. However, with our promptable embedding design (Figure 2 & Section 3.2), we think CLOC provides an alternative approach that allows the users to interact with our encoder to obtain a fine-grained embedding given a prompt of their interest by specifying a box location or an object description.

Minor fixes for Eq2 and Eq4

Thanks for pointing them out. We will revise them accordingly.

Will you release the pre-trained model (including the Prompter) to the public?

Yes, we aim to release the pre-trained model, and are actively working on that.

[1] Grounded Language-Image Pre-training, CVPR 2022.

审稿意见
5

The submission introduces a new pre-training method called Contrastive Localized Language-Image Pre-training (CLOC). The pre-training method extends CLIP pre-training with additional losses based on the outputs of a new "Prompter" module. This new module consists of a light-weight transformer layer that enhances CLIP image embeddings for regional losses (similarity to bounding box, and grounding of region description).

For training CLOC, the paper also introduces a new captioning pipeline termed Visually-Enriched and Spatially-Localized (VESL). This pipeline first generates detailed image captions and then uses a text-conditioned zero-shot detection model to generate bounding boxes for sub-queries of the caption generated by named-entity recognition.

The paper then compares a CLOC model with a CLIP baseline that is trained on the same data using the same hyper parameters but without the CLOC losses. The performance is compared on Ferret bench, RefLVIS, RefCOCO, and various VQA benchmarks.

给作者的问题

None.

论据与证据

The paper claims that CLOC outperforms traditional CLIP on referring and grounding tasks. This claim is supported by Tables 3 and 4.

The paper also claims that CLOC unlocks zero-shot region-level capabilities. This claim is supported by Table 2.

方法与评估标准

The proposed methods and evaluation criteria are valid and make sense to evaluate the proposed method.

理论论述

There are no theoretical claims in the paper.

实验设计与分析

The experimental designs are appropriate to confirm the two claims mentioned above.

The authors first introduce their own reproduction of CLIP and compare it to the original OpenAI CLIP. Then they ablate various design decisions of their proposed method (CLOC / VESL) and report zero-shot performance on various image and region tasks.

The baseline CLIP and the improved CLOC models are then compared in different benchmarks to confirm the claims about improved capabilities with respect to referring and grounding tasks (Ferret-Bench, RefLVIS, RefCOCO, Flickr), and also improvements on some image-level multimodal benchmarks (Table 5).

补充材料

I have read the supplementary material. I recommend at least reading Section D, which answered a number of questions I had after reading the paper.

与现有文献的关系

The pre-training recipe (CLOC) is mainly anchored on the original CLIP paper (Radford, 2021). This is done on the context of MLLMs that use CLIP as a vision encoder, and here the paper refers to (Tong, 2024). The idea of the light-weight Prompter module is introduced by referring to (Kirillov, 2023). The data annotation pipeline that is used to train the CLOC model is discussed in the context of previous works such as (Minderer, 2024) and (Kirillov, 2023).

遗漏的重要参考文献

I think all essential references are mentioned.

其他优缺点

The paper is well written and illustrated. The formulation of the added module and losses is very clear, and the paper does a great job at walking the reader through the process. I also enjoyed the way how the ablations are first presented, and discussed by comparing individual rows in the main text.

The main weakness of the work is the superficial ablation of the VESL captioning pipeline introduced. I do not find the selected examples in Figure A very convincing, and it's easy to imagine that captioning hallucinations might create problems (although they might be filtered by the object detection model), and that the strict named-entity recognition loses a lot of the interesting information (note that the other figures are a bit misleading in this respect, e.g. a description like "a stunning ocean view" would never be extracted by NER). From the numbers presented in Table 2, I'm a bit puzzled to see such different effects of rows 3-5 vs. 13-15.

其他意见或建议

Missing clarity:

  1. lines 260-262 (right column): are the pairs ignored or are the gradients on fTf_T ignored?

Typos:

  1. line 117: "Another less and arguably more"
  2. line 124: "and are more computation overhead"
  3. line 276: "annotates it"
  4. line 294 (right column): "We implement the in JAX"
  5. line 429 (right column): "in the foresee of"

Various remarks:

  1. "CLIP has become arguably the default choice of vision backbone for multimodal large language models (MLLMs) (Liu et al., 2023; McKinzie et al., 2024) due to its superior prior knowledge in aligning vision and language (Tong et al., 2024)." – Looking at (Tong, 2024) Section D, it's not clear to me how this reference would give evidence to the claim that CLIP is the default choice due to its superior prior knowledge (e.g. vs. SigLIP). Consider backing up this claim more clearly, or re-formulating (both in "Introduction" and "Related Work" sections).

伦理审查问题

No concerns.

作者回复

We thank you for the positive review and constructive comments.

The main weakness of the work is the superficial ablation of the VESL captioning pipeline introduced.

Thank you for your constructive comments, and we will consider better examples in our figures. We agree with the reviewer that image captioning could suffer from hallucination, though it enriches the visual description. The reviewer is also correct that our pipeline filters such cases by relying on the object detector, which is quite effective as it was pre-trained on over thousands of common objects. A central design challenge for our pipeline is balancing the benefits of richer captions with the potential risk of hallucination. We believe our design is robust, considering the effectiveness of current state-of-the-art open-vocabulary object detectors and the inherent resilience of contrastive learning objectives to noisy text annotations in large-scale training.

In Table 2 (rows 3-5 vs. 13-15), we compare with the baseline that does not use the image captioner but the AltText (e.g., left of Figure 3). It could be hard to extract useful object phrases from AltText for the open-vocab detector.

For the hallucination issue, we have investigated our captioning pipeline and compared it with other recent works to show its superiority in terms of the hallucination. We compared it to public models: LLaVA-1.5, Shikra, MiniGPT-4, and InstructBLIP on CHAIR scores [1] (lower values refer to less hallucination). CHAIR_i measures the fraction of hallucinated object instances and CHAIR_s calculates the fraction of sentences containing at least one hallucinated object. The results are summarized below:

CaptionerCHAIR_iCHAIR_s
InstructBLIP14.530.0
MiniGPT-48.224.2
Shikra7.022.0
LLaVA-1.56.220.6
Ours5.919.6

While the long captions from our captioning pipeline may still unavoidably introduce hallucinations even though our captioning pipeline has less hallucinations compared to other models, it can be further mitigated since we only consider confident objects agreed by the detector (L296 left). We also remove very generic words and stopwords, as noted in the code in Listing 1 in the appendix. We believe having more accurate object labels is the key to the improvements in Table 2 for our pipeline, as evidenced by the statistics 11.6 regions per image identified by the pipeline (only 5.1 for the baseline) we report in Table 1 (L275 right).

lines 260-262 (right column): are the pairs ignored or are the gradients on fTf_T ignored?

For filtering region-text conflicts in Section 3.4, the region-text pairs are ignored in Equation 2. That is, these elements are “masked” and will not be considered in the contrastive loss matrix. We will make this clearer in the final version.

Remarks on CLIP

We apologize for any confusion regarding this statement. Our reference to “CLIP” in the context of MLLMs was intended to denote the broader family of language-supervised methods, including both CLIP (as a representative model) and SigLIP. Specifically, we were citing the first row block in Table 12 of Section D (Tong, 2024), which demonstrates that these methods outperform others, such as self-supervised approaches, in MLLMs. We will make this clearer and revise both the introduction and related works as you suggested.

Typos

Thank you for pointing out the typos, and we have fixed them in the revised manuscript.

[1] Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T. and Saenko, K., 2018. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156.

审稿意见
4

This paper proposes Contrastive Localized Language-Image Pre-training (CLOC), an approach extending CLIP-style image-text contrastive learning to also incorporate region-level alignment. The authors introduce a lightweight “Prompter” module that can transform global image embeddings into region-aware representations given bounding boxes. They further design a large-scale pseudo-labeling pipeline (VESL) to generate region-text annotations, resulting in a 2B dataset. Through extensive experiments across classification, retrieval, and multimodal reasoning tasks, the method demonstrates solid improvements over standard CLIP, particularly in fine-grained vision-language scenarios like referring expression comprehension and region-based VQA.

给作者的问题

  • The length of the average of the text per caption is 2.1, which is much shorter than RefCOCOg or WiT. Will this potentially affect the expression performance of the image embedding?
  • Is it possible to include some direct comparison of CLOC to dedicated open-vocabulary detection models (e.g., in terms of box AP metrics)?
  • Can you provide more detail on how the bounding boxes are sampled in the training?
  • Did the authors consider other designs of the prompter?

论据与证据

  • Claim: CLOC enhances fine-grained visual understanding in downstream tasks that require identifying or referring to specific image regions.

    Evidence: Experiments on region-level classification and retrieval benchmarks (e.g., COCO, GRIT) and on MLLM tasks (Ferret, LLaVA) show that CLOC consistently outperforms the CLIP baseline in tasks needing spatial grounding.

方法与评估标准

Yes. The proposed methods make sense.

理论论述

No theoretical claims are made.

实验设计与分析

Yes. I have checked the experimental designs.

补充材料

Yes. I have read all the supplementary material.

与现有文献的关系

This work is a CLIP Extension, which tries to incorporate region-level alignment.

遗漏的重要参考文献

The author should discuss works such as [*1].

[*1] Wan, Bo, et al. "Locca: Visual pretraining with location-aware captioners." Advances in Neural Information Processing Systems 37 (2024): 116355-116387.

其他优缺点

Strengths

  • The prompter module is lightweight, which only causes minimal overhead compared to baseline CLIP.
  • The performance looks promising. From the table and Figure, we can see consistent improvements in fine-grained tasks, region-level retrieval, and large multimodal model reasoning.
  • The data creation pipeline is scalable and might make a great contribution to the community.

Weaknesses

  • The quality and diversity of region-text pairs depend heavily on the open-vocabulary detector and captioning pipeline—if these pipelines introduce bias or errors, the final model inherits them.

其他意见或建议

  • It would be interesting to see if text-based region prompts (e.g., referencing “the person on the left”) work well out of the box at inference without bounding boxes.
作者回复

We thank you for the positive review and constructive comments.

The quality and diversity of region-text pairs depend heavily on the open-vocabulary detector and captioning pipeline.

Thank you for pointing out this important aspect. We agree that the quality of the open-vocabulary detector and captioner is important. Our pipeline is built upon the recent advances in these models (L275, right). We think the proposed framework is promising since, with better and better detectors and captioners introduced, the pipeline can seemingly enjoy the improvements, such as less bias and errors. We briefly discussed it in L782 (left column) of the Appendix, and will further emphasize the pipeline’s dependency on these models for pseudo-labeling in the final version.

To further address the reviewer’s concern, we use the hallucination metric (CHAIR score, where lower is better) to assess the quality of the synthetic captions. As shown below, our captioner demonstrates high quality compared to other models.

CaptionerCHAIR_iCHAIR_s
InstructBLIP14.530.0
MiniGPT-48.224.2
Shikra7.022.0
LLaVA-1.56.220.6
Ours5.919.6

Text-based region prompts

In our experiments, we observed a reasonably low L1 distance of 0.02 between the predicted boxes and ground-truth boxes when the Prompter receives a text region description as input, indicating that text-based region prompts performed well out of the box. In the revision, we will include qualitative visualization examples to illustrate this. A more in-depth investigation is left for future work.

Region caption is short

The reviewer is correct that the region-level captions are much shorter than the image-level captions. However, we want to clarify that in Equation 4, we still retain the original image-level CLIP loss, ensuring that the image embedding quality remains on par with the original CLIP, as evidenced by the “Image tasks” results in Table 2.

Comparison to open-vocab detection models

First, in our paper, we included a comparison in Footnote 1 (L379, left), demonstrating that on the region classification task (predicting class names given a bounding box), our approach achieves over 70% mAcc on COCO, significantly outperforming the 47% reported in previous work.

For comparison with open-vocabulary detection models, we also provide zero-shot evaluation results on COCO Detection (minival), ODinW (test-dev), and LVIS-Det (minival). When comparing GLIP [1] and CLOC, we observe that CLOC consistently achieves better results than GLIP across all backbone categories (T / B / L), suggesting that CLOC offers advantages in localization and object detection performance. Notably, GLIP employs DyHead—a strong decoder/head module—on top of the encoder, whereas our ablation study uses only two simple heads for classification and regression. This further supports that the encoder representation in CLOC is indeed superior. See the table below for detailed results.

ModelViTCOCO-Det (minival)ODinW (test)LVIS-Det (minival)
GLIP-TViT-T/1646.646.526.0
GLIP-LViT-L/1449.852.137.3
CLOC-BViT-B/1647.348.429.6
CLOC-LViT-L/1450.853.638.1

How bounding boxes are sampled during training

During training, we simply sample 4 boxes randomly (padded if fewer) per image.

Designs of the prompter

In our experiments, we mainly consider the prompter takes a bounding box or a single text embedding as the prompt. We compared with a baseline implementation of RoIAlign in Table 2 (row 4, 9, 14), and confirm that the proposed prompter is a better design (discussion in L360 right and also Section 3.4). For other designs, we consider the following as promising future work: (1) different types of prompts, such as points, masks, etc.; (2) multi-prompts or compositional prompts for higher-level prompting. We included more discussions in L770 “Future directions” paragraph.

LocCa reference

We thank the reviewer for this suggestion. LocCa is indeed a relevant work which we have cited and discussed in Section 2 (L100, right). However, LocCa differs significantly from our approach in two important ways: (1) it employs a full encoder-decoder transformer architecture, thus being substantially less efficient, especially for large-scale training; (2) LocCa embeddings do not directly facilitate zero-shot retrieval or classification tasks as our embeddings do (Table 2). The focus of our method remains specifically on improving CLIP-based localization capabilities, and we will further clarify this distinction in our final revision.

[1] Grounded Language-Image Pre-training, CVPR 2022.

审稿人评论

Thanks for the rebuttal. Most of my concerns are solved. I will update my rating.

最终决定

This paper proposes Contrastive Localized Language-Image Pre-training (CLOC), a CLIP-like pretraining method that enhances region-level vision-language alignment using a new lightweight Prompter module and a large-scale region-text annotation pipeline. It improves fine-grained recognition and retrieval tasks, such as referring and grounding tasks, outperforming standard CLIP.

After the author-reviewer discussion, the reviewers reached a positive consensus (1 strong accept, 2 accept, 1 weak accept). Their positive comments are mostly based on the strong and solid experimental results.

I also agree with the reviewers. Overall, I recommend "Accept" for this paper.