PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
3
3
4
2
ICML 2025

FG-CLIP: Fine-Grained Visual and Textual Alignment

OpenReviewPDF
提交: 2025-01-14更新: 2025-07-24

摘要

关键词
Vision-Language ModelContrastive LearningImage-Text Dataset

评审与讨论

审稿意见
3

This paper proposed FG-CLIP, A region level contrastive learning model for fine-grained image representation. Through training on large scale synthetic data. This model achieve strong performance compared to previous methods on fine-grained region-level tasks like fine-grained understanding, OVD, image-text retrieval and image classification, serving as a promising vision fundation model compared to CLIP.

给作者的问题

  1. Alpha-CLIP should be considered for comparison.

  2. I personally believe rewritted captions from LVLM like CogVLM and detection through YOLO-world will restrict the diversity of the data. As the model also benchmarked on data that are either synthesized (Share4V) or not that finegrained enough (COCO, ImageNet, RefCOCO...). Thus I believe:

  • Data diverity analysis is important to be quantitatively discussed.
  • Consider more fine-grained dataset like LVIS, V3Det to defend the claim of the paper.
  1. If this paper want to claim that FG-CLIP serves as better visual encoder for LVLM, merely tested on GQA and POPE is not enough. More numbers on other LVLM benchmarks should be included.

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The evaluation criteria generally make sense. But more FG dataset like OV-LVIS, V3Det should be considered.

理论论述

The theroretical claims are correct as they mostly from previous method (FineCLIP).

实验设计与分析

The experimental designs are valid.

补充材料

Yes

与现有文献的关系

I believe the technical novelty is kind of limited as the region level contrastive learning and hard nagtive samples are direct adoption of existing methods. the main contribution is a large scale synthetic region-text paired data.

遗漏的重要参考文献

The key contribution of model and loss is from FineCLIP published in NeurIPS 2024. The authors should directly discuss this in the methods section.

Experiment results on region level classification should consider Alpha-CLIP in CVPR 2024 for comparison.

其他优缺点

Major strength: Strong performance is achieved compaired to previous methods.

Major weakness: The methods of regional contrastive learning and hard nagative sampling are direct adoption from FineCLIP and ALBEF, which make the model contribution marginal. The major contribution is the dataset. As the dataset is also collected and synthesised without new techinical contribution, the novelity of this paper is kind of limited.

其他意见或建议

No

作者回复

1. Response to Question 1 in Essential References Not Discussed

Thanks for your comments. We discuss the difference between FG-CLIP and FineCLIP and provide experimental results in the first response to reviewer 6zz8. We refer you to that response for more details.

2. Response to Question 2 in Essential References Not Discussed and Question 1 in Questions For Authors

Thanks for your suggestion to include a comparison with Alpha-CLIP on region level classification. We test Alpha-CLIP using its "clip_b16_grit1m+mim_fultune_4xe" weight across three testing configurations: Alpha Map, RoIAlign, and Crop&Resize. The Alpha Map method involves creating a mask based on the label box information and then combining this mask with the entire image, which aligns with Alpha-CLIP 's training objective. The RoIAlign method is a testing approach in our FG-CLIP. Meanwhile, the Crop&Resize method refers to cropping the image region corresponding to the box and resizing it to the resolution required by ViT. The table in https://anonymous.4open.science/r/ICML_RE-3CF6/compare_with_alphaclip.md shows FG-CLIP consistently outperforms Alpha-CLIP across different configurations. We will add these results to our manuscript.

3. Response to Major weakness

Thank you for your insightful comment regarding the methods of regional contrastive learning and hard negative sampling, as well as the novelty and contribution of our dataset.

3.1 Regional Contrastive Learning and Hard Negative Sampling

We have discussed the difference between FG-CLIP and FineCLIP in an earlier response. Here, we highlight distinctions between hard negative sampling of FG-CLIP and ALBEF.

ALBEF proposes a strategy to sample hard negatives for its ITM task. Specifically, ALBEF samples hard negatives using softmax-normalized image-to-text and text-to-image similarity to find in-batch negatives, selecting one negative text/image per mini-batch. In contrast, FG-CLIP conducts a novel pipeline to create challenging fine-grained negative samples. As shown in lines 236 to 251 of the right column of the manuscript, we modify the attributes of bounding box descriptions while keeping the object names unchanged. We generate 10 negative samples for each positive sample. This process generates subtle variations where objects may appear similar but differ in specific details.

3.2 Contribution of the Dataset

Our work leverages two high-quality datasets. In the first stage, we utilize an extensive dataset of 1.6 billion long caption-image pairs to capture global-level semantic details. In the second stage, we employ a carefully curated dataset of 12 million images with 40 million corresponding bounding boxes and captions, which are specifically designed to provide fine-grained annotations. We also generate 10 million challenging fine-grained negative samples, improving the model's ability to distinguish subtle differences. These datasets enable the model to achieve superior performance on various benchmarks. In summary, we have validated the entire synthesis pipeline used to create these datasets, including innovative techniques such as generating challenging fine-grained negative samples, which provide reference for others. We plan to make this dataset public to support further research in visual grounding and fine-grained understanding.

4. Response to Question 2 in Questions For Authors

To quantitatively discuss the diversity of our dataset, we compare it with other fine-grained datasets such as LVIS and V3Det. We extract and aggregate category labels from captions generated through steps involving CogVLM and YOLO-world. The following table compares the number of images and unique category labels across different datasets. Notably, even when sampling an equivalent number of images (243k), our dataset yields more unique category labels than V3Det, indicating higher diversity. We visualize the category labels of sampled data (equivalent to 243k images) using t-SNE plots in https://anonymous.4open.science/r/ICML_RE-3CF6/data_tsne_pic.png The visualization also shows that our dataset has a more diverse set of category labels at the same image scale. As the dataset scales up to 12M images, the diversity in category labels and captions increases significantly.

DatasetImageCaptionCategory Label
LVIS164k1.27M1.2k
V3Det243k1.75M13k
Ours (sampling 243k)243k815k21k
Ours12M40M128k

To defend the claim of our paper, we evaluate our model on LVIS. The results in the following table show FG-CLIP achieves SOTA performance.

MethodTop-1Top-5
CLIP24.7946.63
EVA14.3629.11
FineCLIP23.2944.17
FG-CLIP28.5552.60

5. Response to Question 3 in Questions For Authors

We conduct experiments on other LVLM benchmarks at https://anonymous.4open.science/r/ICML_RE-3CF6/add_result_mm_compare.md . The experimental results show that LLaVA with FG-CLIP achieves better prefomance.

审稿人评论

Thank you for your detailed response. Although the technical novelty of the paper is somewhat limited, the thorough experiments presented in the rebuttal are quite impressive and have addressed most of my concerns. I would like to raise my overall recommendation to a 3 and suggest that the detailed results and discussions from the rebuttal be incorporated into the revised paper.

作者评论

We are very glad to have addressed most of the concerns and are deeply thankful for your acknowledgment of our rebuttal. This improvement is largely due to the valuable suggestions from you and other reviewers. We will include the content from the rebuttal in the final version.

审稿意见
3

Based on the observation that CLIP struggles to handle with fing-grained understanding tasks, this paper propose

  1. a larger dataset including abundant images, bounding boxes and captions;
  2. to incorporate long captions, short captions and hard negative strategies to enhance CLIP's ability during training.

The extensive experiments demonstrate the effectiveness of this paper.

给作者的问题

Please see above weaknesses and comments.

论据与证据

Yes, the paper claims that original CLIP model cannot well cope with fine-grained understanding tasks. And this claim is supported by experimental results in Table 1 to Table 6.

方法与评估标准

Yes, the proposed method and dataset makes sense for the problem.

理论论述

This paper doesn't include any theoretical claims.

实验设计与分析

Yes, I have checked the experiment section. The experimental designs are sound to evaluate the fine-grained understanding ability of CLIPs (including original CLIP, EVA-CLIP, etc. and this paper proposed FG-CLIP).

补充材料

I have reviewed the supplementary material, including visual grounding dataset visualization, positive and negative description examples, attention visualizations and a compared table on FG benchmark.

与现有文献的关系

  1. This paper enhances fine-grained understanding ability of CLIPs.
  2. A larger dataset is proposed for the community.

遗漏的重要参考文献

No, all related works are discussed in the paper.

其他优缺点

Strengths:

  1. A well-written paper to follow.
  2. Effective module design.

Weaknesses:

  1. Limited ablation and disscussion with negative sampling. How many hard negative samples per image-text pair?
  2. More visualization on hard negative samples.

其他意见或建议

This paper construct a high-quality visual grounding dataset, with 12 million images and corresponding bounding box and captions. It would be better to discuss this dataset with previous ones like LAION, COCO, and so on in a Table.

作者回复

1. Response to Weakness 1

Thanks for pointing out this problem. We perform ablation studies on the number of hard negative samples. Specifically, we test configurations with 1, 5, and 10 hard negative samples per positive sample. Our experiments show that 10 hard negative samples per image-text pair yield the best performance. We agree that this ablation study adds significant value to our manuscript and will include it in the final version. Thank you for pointing out this important aspect.

Fine-Grained Understandinghardmediumeasy
FG-CLIP (1 hard negative sample)44.4965.8767.82
FG-CLIP (5 hard negative samples)45.8666.7867.21
FG-CLIP (10 hard negative samples)46.4067.1568.59

2. Response to Weakness 2

Thank you for your comment regarding the need for more visualizations of hard negative samples. We appreciate the opportunity to provide additional insights into how our method, FG-CLIP, benefits from hard negative sampling. We provide more visualization in https://anonymous.4open.science/r/ICML_RE-3CF6/fgshow.png . Specifically, we extract the dense image feature and visualize the similarity matrix to qualitatively analyze the impact of hard negative sampling. As illustrated in the figures, our FG-CLIP can capture the regions more accurately after performing hard negative sampling. For example, in the first row, the phrase "Man in red clothes" is accurately identified with hard negative loss, whereas without it, the model struggles to capture the correct region.

3. Response to questions in Other Comments Or Suggestions

Thank you for your valuable suggestion regarding the comparison of our dataset with previous ones such as LAION, COCO, and others. We agree that a detailed discussion in the form of a table would better highlight the unique strengths and contributions of our dataset.

In addition to our visual grounding dataset of 12 million images with corresponding bounding boxes and captions, we have also incorporated an additional 1.6 billion image-text pairs in the first stage of training. The dataset in the first stage is generated using a large multimodal model to produce higher-quality fine-grained long captions compared for capturing global-level semantic details.

We have compared our dataset with several related datasets, as shown in the table below. Overall, our dataset stands out in terms of scale and quality, particularly in its fine-grained annotations and challenging negative samples. Here are the key points of comparison:

  • Scale: Our dataset contains the largest number of images, bounding boxes, and captions among all the datasets except LAION. While LAION has the largest number of captions (2B), the quality of these captions is often noisy and inconsistent. Our supplementary dataset adds 1.6B high-quality, fine-grained long captions, enhancing the overall utility of our data.
  • Bounding Boxes: Among the widely used datasets, only COCO provides bounding box annotations. However, our dataset surpasses COCO by an order of magnitude, with 40M bounding boxes compared to COCO's 1.5M.
  • Hard Fine-Grained Negative Samples: A distinctive feature of our dataset is the inclusion of 10M hard fine-grained negative samples. These samples help the model differentiate subtle differences in semantically similar pairs, thereby improving its performance across various downstream tasks.
DatasetImageImage captionBounding boxRegion captionHard fine-grained negative sample
LAION-2B2B2B000
Flickr30k30K150K000
CC3M3M3M000
COCO330K330K1.5M1.5M0
Ours in stage11.6B1.6B+1.6B000
Ours in stage212M12M+12M40M40M10M
审稿人评论

My concerns are addresed by authors' rebuttal. Hence, I recommend acceptance.

作者评论

We are very glad to have resolved the concerns you raised, and we sincerely appreciate your recommendation to 'accept' our work. We will incorporate the new content based on your suggestions into the latest version of our paper.

审稿意见
4

The proposed method introduces Fine-Grained CLIP (FG-CLIP) for enhancing CLIP's fine-grained understanding capabilities. The authors propose three components to address this challenge: First, they generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, they construct a high-quality dataset with 12 million images and 40 million region-specific bounding boxes along with detailed captions. Third, they introduce 10 million hard fine-grained negative samples to help the model learn to distinguish between subtle semantic differences. For the proposed method, the training occurs in 2 stages. First stage focuses on global contrastive learning using the long caption-image pairs. Second stage incorporates regional contrastive learning and hard negative samples. The proposed model extends CLIP with position embeddings that can handle longer text (up to 248 tokens vs. the original 77) and uses ROIAlign to extract region-specific features from images. The paper further showcases results on various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks.

Update after rebuttal

The authors have answered all the concerns mentioned in the weaknesses and experimental design sections. Furthermore, additional visualizations and results on datasets like OpenImages and LVIS strengthen their claims. They also propose a lower-resource training setup and plan to release a distilled model, improving accessibility. I increase my score to accept.

给作者的问题

  • Is the proposed method better because of the design or just because of more data? Could you test this by training other models like FineCLIP on the new proposed dataset?
  • Why doesn't regional contrastive learning improve results much on hard and medium fine-grained tasks in Table 6? This seems unexpected.
  • Can you show visual examples of how your hard negative sampling helps tell apart similar items? Some visualizations would make your point clearer.
  • Have you tested your approach on other datasets like OpenImages or NUS-WIDE? This would show your method works broadly.
  • Your training needs lots of computing power (160×910B NPUs and 8×NVIDIA H800 GPUs). Could researchers with less resources still use your approach? Are there more efficient ways to get similar results?

论据与证据

The claims made in the submission are supported by clear and convincing evidence across multiple experiments. The paper's claims regarding improved fine-grained understanding are substantiated in Table 1 and Appendix C.2, while claims about enhanced bounding box classification capabilities are validated in Table 2. The assertions about superior long caption image-text retrieval and open-vocabulary object detection are demonstrated through comprehensive results in Table 3. Furthermore, the claims about FG-CLIP's effectiveness when used as a backbone for Large Multimodal Models for attribute analysis, object localization, and reducing output hallucination are thoroughly supported by Tables 4 and 5. The paper also provides visualization evidence in Appendix C.1, further reinforcing these claims. Overall, most claims presented in the introduction are backed by quantitative and qualitative results, providing a strong empirical foundation for the paper's contributions.

There is one concern however, further discussed in the “Experimental Designs Or Analyses”, namely whether individual contribution of the new data vs. the new model was substantiated.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem at hand. The authors properly identify limitations in existing CLIP models regarding fine-grained understanding and propose targeted solutions.

理论论述

No theoretical claims in the paper; the method proposes empirical and dataset curation. Equations 1-5 are standard loss functions.

实验设计与分析

  • There is, however, a major question of fairness in comparison since the current method is trained on data different from that of previous methods. It would be more convincing if prior methods, such as FineCLIP, should have been trained on this new dataset to determine whether the improvements stem primarily from the data quality or the model architecture itself.
  • The paper doesn't clearly show if using more training data is the main reason for better results, especially since FG-CLIP's local and global feature approach is similar to FineCLIP.
  • Testing other recent methods on the proposed dataset would make the comparisons more complete.
  • More examples showing how the model describes different parts of images would better prove its fine-grained understanding.
  • The results in Table 6 raise questions because adding regional contrastive learning doesn't improve performance much on “hard” and “medium” fine-grained tasks, which needs more explanation.

补充材料

I reviewed the supplementary material, which includes examples of the curated visual grounding data (Appendix A), positive and negative descriptions related to image regions (Appendix B), and visualization comparisons (Appendix C).

与现有文献的关系

The key contribution follows previous research on vision-language models like FineCLIP and RegionCLIP, focusing specifically on region-level representation techniques. The key contribution of the proposed novel approach is related to advanced region-based reasoning and cross-modal alignment methods previously explored in multimodal learning research.

遗漏的重要参考文献

I think all the relevant literature has been discussed and referenced.

其他优缺点

Strengths:

  • Achieves state-of-the-art performance on multiple downstream tasks, including fine-grained understanding, object detection, and image-text retrieval. -Extensive dataset curation with 1.6 billion long caption-image pairs, 40 million region-specific annotations, and 10 million hard negative samples provides a valuable contribution (beyond the model). -Comprehensive ablation studies demonstrate the individual contributions of each proposed component (global contrastive learning, regional alignment, and hard negative samples).

Weakness:

  • While Table 6 shows quantitative improvements from hard negative sampling, there is limited qualitative analysis (such as t-SNE visualizations) demonstrating how subtle differences are actually differentiated. -The paper shows impressive performance improvements, but it is not entirely clear if these come from the larger dataset or the new model design. While FG-CLIP offers interesting ways of handling local and global features, more detailed studies would help us understand precisely what is driving these performance gains.
  • OpenImages or specialized fine-grained datasets such as NUS-WIDE that contain naturally fine-grained categories would strengthen the evaluation.

其他意见或建议

No typos found.

作者回复

We sincerely thank the reviewer for your constructive comments. We address your concerns below.

1. Response to Questions 1-3 in Experimental Designs Or Analyses and Weakness 2

Thank you for your comments regarding the improvement factor and fair comparison. The improvement of FG-CLIP stems from both the model architecture and the proposed new dataset. To support this, we provide the differences between FineCLIP and FG-CLIP. Moreover, we train our proposed dataset on FineCLIP to conduct a fair comparison.

1.1 Differences between FineCLIP and our FG-CLIP.

  • A large-scale dataset with diverse captions can enhance FG-CLIP's fine-grained understanding capabilities. Specifically, our FG-CLIP integrates short captions with long captions, while FineCLIP only utilizes short captions during global contrastive learning. Moreover, we use a large-scale dataset (i.e., 1.6B+12M) rather than the small dataset of FineCLIP (i.e., 2.5M).
  • FG-CLIP discards the self-distillation strategy used in FineCLIP, which introduces significant additional computational overhead. For example, when using a single GPU with a batch size of 32 samples, incorporating the self-distillation strategy increases memory usage from 25GB to 75GB during training and reduces FPS from 25 to 9.8. This is mainly due to additional feature extraction for each bounding box in the input images.
  • Hard fine-grained negative sample learning helps FG-CLIP distinguish subtle differences in semantically similar pairs.

1.2 We then train our proposed dataset on FineCLIP. Due to time constraints, we perform this experiment on the 12M dataset, instead of the larger 1.6B+12M setup. From the table, the substantial improvements (Row 1 -> Row 2 & Row 2 -> Row 3) highlight that both our proposed dataset and model architecture are significant for FG-CLIP.

MethodData SourceCOCO-Box-Top-1COCO-Retrieval-I2TCOCO-Retrieval-T2I
FineCLIPFineCLIP (CC2.5M)50.754.440.2
FineCLIPFG-CLIP (12M)53.559.646.2
FG-CLIP (Ours)FG-CLIP (12M)56.165.947.1

2. Response to Question 4 in Experimental Designs Or Analyses

Thank you for this insightful question. The qualitative results in https://anonymous.4open.science/r/ICML_RE-3CF6/partshow_pic.png show that FG-CLIP is able to capture different parts of images that are strongly related to the input text, demonstrating its fine-grained understanding capabilities. We follow the same experimental settings as Appendix C.1.

3. Response to Question 5 in Experimental Designs Or Analyses

Thank you for pointing this out. In the "hard" fine-grained task, only one attribute within the caption is replaced. Similarly, there are two replaced attributes within the caption in the "medium" task. However, regional contrastive learning primarily enhances region-text alignment in FG-CLIP, which may struggle to distinguish these fewer replaced attributes. To this end, we propose hard negative sample learning to further improve the overall performance of FG-CLIP. We will make this clearer in the manuscript.

4. Response to Weakness 1

Thank you for raising this point. We follow the same experimental settings as Appendix C.1 and further provide the qualitative results in https://anonymous.4open.science/r/ICML_RE-3CF6/fgshow.png . After performing hard negative sampling, our FG-CLIP can capture the regions more accurately. For example, the highlighted region of "Man in red clothes" with hard negative loss in 1st row shows significantly better than that without hard negative loss.

5. Response to Weakness 3

Thank you for this suggestion. We compare several baselines and FG-CLIP on the validation set of OpenImages. We conduct bounding box classification and follow the settings of COCO. The results further demonstrate the effectiveness of FG-CLIP’s fine-grained capabilities.

MethodTop-1Top-5
CLIP18.0240.90
EVA8.8320.90
FineCLIP18.1042.16
FG-CLIP20.6047.43

6. Response to Questions For Authors

Thank you for these insightful questions. For the first four questions, we refer you to our earlier responses. Here, we focus on addressing the 5th question regarding computational resources.

For researchers with fewer computing resources, we suggest discarding the first stage and directly training FG-CLIP on the dataset of 12 million images in the second stage. We conduct this experiment using 4×NVIDIA A100 GPUs, and the training process takes approximately 14 hours. Our experimental results in https://anonymous.4open.science/r/ICML_RE-3CF6/result_different_data.md indicate that this approach achieves slightly lower performance compared to the original settings but remains effective.

Another possible method is to distill a model based on our pre-trained weights, which significantly reduces the computational burden. We plan to release detailed guidelines and tools for implementing this distillation process in future work, making our approach more accessible to researchers with limited resources.

审稿人评论

The authors have answered all the concerns mentioned in the weaknesses and experimental design sections. Furthermore, additional visualizations and results on datasets like OpenImages and LVIS strengthen their claims. They also propose a lower-resource training setup and plan to release a distilled model, improving accessibility. I would increase my score to accept.

作者评论

We are very pleased that our rebuttal has largely addressed your concerns regarding the weaknesses and experimental design sections. We are also deeply grateful for the professional questions you raised, which have significantly improved the quality of our paper. In future versions, we will incorporate the content from our rebuttal into the manuscript.

审稿意见
2

This paper proposes to fine-grained CLIP by introducing additional high-quality data and designing specific loss functions for training. As for the data, original CLIP only use short global caption data, while this work introduces long global caption, region-level caption, and region-level negative caption for training. For model training, this work introduces 2 stage training framework, with the first stage focusing on global-caption training and the second stage using all data. Experiments on several fine-grained downstream tasks demonstrate the effectiveness of the proposed method.

给作者的问题

  1. In Line 216, how did you sample K regions in a batch. Is this from the same image? If so, how did you sample K regions if the number of regions in the current image is less than K?
  2. In Line 218, how did you segment the full-captions into small phases and establish the connections with visual boxes?

论据与证据

The proposed idea are verified by sufficient ablation studies with the newly introduced data. Table 6 demonstrates the effectiveness of global contrastive learning, regional contrastive learning, and hard fine-grained negative samples learning.

=======================

After carefully reading the rebuttal and other reviews, I think more in-depth discussions of hard-negative should be included in the main text. The challenges mentioned in the rebuttal can possibly be addressed by some carefully designed methods. I also agree with Reviewer 1aL6 that the novelty is somehow limited as both fine-grained CLIP and hard negative ideas have been explored before. But I acknowledge that the manuscript is solid work as other reviewers. So my final decision is borderline.

方法与评估标准

Yes, the proposed method is evaluated on fine-grained tasks, e.g., fine-grained understanding, bounding box classification, long caption image-text retrieval, and open-vocabulary object detection. It also maintains strong performance on coarse-grained tasks, e.g., short caption Image-Text Retrieval, and Zero-shot Image Classification. The evaluation metrics follow the standard practices for each task.

理论论述

N/A

实验设计与分析

Overall, this work has made non-trivial contributions by introducing better training data with fine-grained details, which improve the performance for fine-grained tasks. The training is conducted in two levels: global image level and local region level. Besides, hard negatives are also introduced during training.

However, this work did not sufficiently discuss all possible combinations of training loss. For example, both long and short captions are introduced for global image level training. Do we also need to consider long and short captions for each region-level boxes? Hard-negatives are only introduced at the local region level training. Do we also need to consider hard negatives during global image level training? Should we consider hard-negatives for both long and short caption?

There could be many loss function terms if we consider all the situations above. Can the author provide more in-depth analysis why they use the loss function proposed in the paper?

补充材料

Yes, I read all sections of the suppl. One question in Table 7, the proposed FG-CLIP can still give very high-scores for hard negatives in many cases. Can the author provide more insight behind this?

与现有文献的关系

The work is built upon CLIP, and enrich the fine-grained understanding ability with newly-introduced captions with fine-grained details.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

伦理审查问题

N/A

作者回复

Thank you very much for your positive feedback and recognizing the non-trivial contributions of our work. In response to your specific questions, we provide detailed explanations below, aiming to clarify any concerns.

1. Response to questions in Experimental Designs Or Analyses

Thanks for your insightful comments and suggestions. Your points about considering long and short captions at both global image level and region-level boxes, as well as the introduction of hard negatives during global image level training, are indeed critical aspects that deserve further elaboration.

As mentioned in our manuscript (line 264), the captions used for region-level training are derived from the global long caption using the SpaCy tool. Specifically, we extract region-specific descriptions from the global long caption to generate a single type of text description for each region-level box. This method is efficient and already contains detailed information about the objects within those regions. Given that this approach adequately describes the objects, it provides sufficient context for regional contrastive learning without the need for additional caption types. Therefore, we only utilized this single type of caption for Regional Contrastive Learning. This streamlined approach not only enhances computational efficiency but also ensures that the model focuses on high-quality, relevant textual descriptions, leading to improved performance.

Regarding the consideration of hard negatives during global image level training, we acknowledge that this could potentially enhance the robustness of the model. However, there are practical challenges associated with implementing this approach:

  • Limitations with Long Captions: We introduce the process of creating challenging fine-grained negative samples in Section 3.2. Specifically, we modify attributes of bounding box descriptions while keeping the object names unchanged. For global long captions, which describe multiple regions within an image, generating meaningful hard negatives by modifying attribute words for each region becomes complex. The resulting negative samples might deviate too far from the original caption, losing their effectiveness as "hard" negatives. Consequently, they may no longer serve as hard negative examples for training.
  • Limitations with Short Captions: As illustrated in Figure 1 of our manuscript, global short captions often lack detailed descriptions of individual objects or regions. This makes it difficult to create meaningful hard negatives through attribute modification, as the short captions may not contain enough fine-grained information to make such modifications impactful.

2. Response to questions in Supplementary Material

This is indeed an interesting phenomenon and can be attributed to several factors.

The fine-grained hard negatives we created involve modifying only a few attribute words in the captions. In some cases, these modifications may still result in captions that are quite similar to the original descriptions of the corresponding image regions. This similarity can cause FG-CLIP to give relatively high scores to these hard negatives because the model perceives them as still being relevant to the image regions. Additionally, in certain scenarios, the modified attributes might correspond to minor changes within the image region, which do not significantly alter the overall visual content. Consequently, the model continues to assign high scores due to the minimal perceptual difference between the original and modified captions.

3. Response to Question 1 in Questions For Authors

Thank you for your question regarding how we sample K regions in a batch (Line 216). To clarify:

K is not a fixed number but rather represents the total number of valid bounding boxes (bbox) across all images within a batch. This means that K dynamically adjusts based on the actual number of available regions in the batch. This approach ensures flexibility and adaptability in handling batches with varying numbers of regions without introducing artificial constraints.

We acknowledge that this explanation may have been unclear in the initial submission. To avoid any potential misunderstandings, we will provide a more detailed clarification in the final version of our manuscript.

4. Response to Question 2 in Questions For Authors

We introduce this process in lines 263 to 272 of our manuscript. Specifically, we utilize SpaCy to parse the captions and extract referring expressions. These extracted expressions are then fed into a detection model to generate corresponding bounding boxes. Non-maximum suppression is applied to eliminate overlapping bounding boxes, retaining only those with predicted confidence scores higher than 0.4. This method allows us to effectively link textual referring expressions with their corresponding visual elements, facilitating more accurate and contextually relevant training for our model.

最终决定

The submission received relatively positive final ratings (1 Accept, 2 Weak Accept, and 1 Weak Reject). All three positive have mentioned that almost all their concerns have been addressed. The sole negative reviewer one didn't respond during the discussion period. Thus, we think that there is no more concerns about this submission, and recommend Weak Accept if there is room in the program.