PaperHub
5.3
/10
Rejected4 位审稿人
最低3最高8标准差1.8
5
5
3
8
4.0
置信度
正确性2.5
贡献度2.5
表达3.3
ICLR 2025

What If We Recaption Billions of Web Images with LLaMA-3?

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
image-text datasets; synthetic captions

评审与讨论

审稿意见
5

This paper addresses the challenge of noisy web-crawled image-text pairs, which can undermine the performance of vision-language models. The authors propose a straightforward pipeline that uses a fine-tuned, open-source LLaMA-3-8B language model to recaption around 1.3 billion images in the DataComp-1B dataset. By improving textual descriptions, their approach, termed Recap-DataComp-1B, provides a cleaner, more semantically aligned dataset that enhances vision-language model training. Empirical results show performance improvement: for discriminative models like CLIP, zero-shot performance in cross-modal retrieval tasks improves, while for generative models such as text-to-image Diffusion Transformers, image generation aligns more closely with user instructions.

优点

  • Effective recaptioning pipeline improves both discriminative and generative model performance.

  • It shows that the open-source model LLaMA-3 enables accessible advancements in dataset quality.

  • Demonstrated benefits in real-world applications, notably in zero-shot retrieval and text-to-image generation tasks.

缺点

  • The work does not propose new ideas or methods and focuses on empirical verification, but it seems not provide in-depth or insightful experimental analysis, limitations (e.g., hallucination) and strengths of recaptioning, or sufficient valuable practices beneficial to the community. Most of conclusions (e.g., recaptioning is helpful for image generation in DALLE 3) have been known or have been explored in prior work.

  • Is the training with recaptioning harmful to other aspects, e.g., knowledge. Specifically, for example, “Western Kingbird” is recaptioned into a detailed appearance-oriented description. After trained with these recaptions, would the model know the concept “Western Kingbird”? Maybe some evaluations on knowledge-intensive VQA are needed.

  • Using LongCLIP-Large to evaluate semantic alignment between long captions and images may be improper since long sentences may be preferred by this model than short ones.

  • The cost of recaptioning need to be detailed.

  • The section names, e.g., CLIP in Section 5, are not informative enough.

  • Some challenging benchmarks should be considered, e.g., winoground [a] for retrieval and T2I-CompBench [b] for generation.

[a] Thrush T, Jiang R, Bartolo M, Singh A, Williams A, Kiela D, Ross C. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022 (pp. 5238-5248).

[b] Huang K, Sun K, Xie E, Li Z, Liu X. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems. 2023 Dec 15;36:78723-47.

问题

  • In Sec 6, why to use the raw CLIP text encoder instead of a encoder trained with recaptioned text?

  • Compared with retrieval, generation preferred recaptioning since the best performance happens in mixed ratio p = 0.00 in Tab 7. What is the reason?

  • Some experimental details are unclear. For instance, what does “re-caption” denote? Using re-captioned prompts for training or inference or both?

评论

Q5: Section name.

A5: Thanks for your suggestions, we have updated the section name in the main text.


Q6: Challenging tasks evaluation.

A6: Thank you for highlighting these benchmarks. We agree that complex evaluation tasks can better assess a model's capabilities. Following your suggestion, we have conducted evaluations on Winoground.

ModelImage AccText AccAverage
Recap-L/14-33617.540.715.0
DataComp-L/1411.029.07.2
ViT-L-16-SigLIP-38415.339.813.0

We can observe that ours consistently and substantially outperforms others.


Q7: Why use the raw CLIP text encoder in Section 6?

A7: Firstly, we would like to clarify that using the raw CLIP text encoder can provide a consistent and fair reference point that the research community relies on. We also perform additional experiments in switching to our Recap-CLIP text encoder in T2I training: we find that, while quantitatively there is no metric improvement, we as human observers qualitatively can see that the generated images are slightly better especially in generating text.


Q8: Why generation prefers p=0?

A8: Thanks for the question. Firstly, we would like to clarify that p=0.1 achieves the best overall performance across metrics. Compared to CLIP training which is discriminative, T2I generation would typically prefer text with more detailed descriptions—this can help in obtaining a more informative text latent which will be helpful in instructing the overall image generation.


Q9: What does “re-caption” denote? Using re-captioned prompts for training or inference or both?

A9: Sorry for the confusion. We have clarified our setting in the main text. Please refer to Section 6 evaluation.


Reference

[1] Fan, Lijie, et al. "Improving CLIP training with language rewrites." Advances in Neural Information Processing Systems 36 (2024).

评论

Dear Reviewer mTNK,

We sincerely appreciate your review. We have carefully considered each of your questions and provided detailed responses in the rebuttal. Please let us know if you have any further questions or concerns.

Thanks!

评论

Thanks for your response. I intend to keep my original score considering the following points.

  1. Novelty and contribution. I raised this concern in Q1 and the authors highlighted the construction of a large-scale recaptioning dataset in the rebuttal. However, I think such a job can be done by most researchers who have sufficient computing resources, i.e., using an open-sourced LLM (e.g., LLaMA-3) to synthesize detailed captions, and then carry out experiments based on existing methods (e.g., CLIP and Stable Diffusion). Besides, I did not get any further insightful and in-depth analysis in the rebuttal phase.

  2. Potential drawbacks in knowledge-intensive domains. Regarding Q2, the authors' response did not solve my concern, because the benchmarks used in the provided experiments, including ImageNet, Flickr30K, and MS-COCO, belong to general domains (e.g., daily life scenarios), not knowledge domains I expected.

These are the two points that I am most concerned about, but regretfully the authors did not provide valuable responses to motivate me to raise my score.

评论

Dear Reviewer XLPB

Thank you for your feedback. We would like to provide further clarification below:

  1. First, we would like to clarify that we intentionally kept all components --- such as the recaptioning pipeline and the model training protocol --- as simple as possible, and exactly followed the existing evaluation protocol (such as which datasets should be used for evaluation). The purpose here is to maximally demonstrate that, by having these synthetic captions at scale, the most straightforward and naive setting already yields strong performance. We believe this may be the best way of presenting to let the public believe in the strong potential of this pipeline and the released dataset and to motivate more future research in this direction.

  2. Secondly, we would like to stress that our work provides an important chance for the general public to comprehensively understand the benefits of recaption (by the advanced MLLM) at scale, which, to our best knowledge, is not provided by any prior works. By releasing our dataset to the community, we hope to motivate extensive follow-up research that will firmly establish this field, like exploring how to leverage synthetic captions to enhance model training and improving existing recaptioning pipelines to obtain higher-quality synthetic captions.

We hope these additional responses/clarifications can address your concerns. Authors of 7751

评论

Thanks for the appreciation of our work. We address your concerns below:

Q1: Novelty and in-depth analysis.

A1: Thank you for raising this concern. We would like to stress that our primary focus is on creating Recap-DataComp-1B, which, to the best of our knowledge, is the first publicly available image-text dataset with synthetic captions scaled to the billion level using LLaMA-3. We believe this represents a novel and significant contribution to the multimodal research community. While the concept of recaptioning is not new, scaling it to this magnitude with advanced LLMs has not been seen before. More importantly, this large-scale dataset enables the first public, extensive investigations into training CLIP and T2I diffusion models with high-quality synthetic captions. For example, our results comprehensively demonstrate that Recap-DataComp-1B significantly enhances cross-modal tasks, long-context understanding, and text-to-image generation.

Furthermore, as shown in our added rebuttal experiments, our dataset consistently improves the performance of other CLIP-family models, like LaCLIP [1]. It can also serve as an improved pre-training dataset for LLaVA-family models.

Based on this evidence, we believe that Recap-DataComp-1B is a novel and important contribution to the community, with the strong potential to provide significant benefits to future multimodal research.


Q2: Is the training with recaptioning harmful to other aspects, e.g., knowledge?

A2: Thank you for raising this concern. We conduct an ablation study on a 30M subset of data with recaptions generated conditioned on the original captions, which is expected to preserve more knowledge in the original caption than our non-conditioned prompt. From the table below, we can see this strategy improves the performance across all metrics, suggesting that designing a better strategy to preserve original captions’ knowledge (e.g., via prompt engineering) could be a strategy to further enhance our framework. We leave its systematic and comprehensive exploration as a future work.

ModelCondition on Original CaptionMix RatioIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
L/16-1.066.148.665.330.241.7
L/16No0.667.561.177.839.554.0
L/16Yes0.668.262.078.740.155.5

Q3: LongCLIP-Large prefers long-caption.

A3: Thank you for raising this concern. We would like to clarify that our primary goal is to measure the alignment between captions and images, both short and long captions. Most publicly available CLIP models, such as OpenAI-CLIP, are optimized for shorter captions, with typical text encoders trained on inputs of up to 77 tokens and often effectively using fewer than 20 tokens.

As a side note, as reported in Section 4.2 of the main paper, we comprehensively report the alignment results not only from LongCLIP-Large but also from OpenAI-CLIP, GPT-4, and human evaluations. We can see that LongCLIP, GPT-4 and human eval consistently prefer to ours than the original caption.


Q4: Recaptioning cost.

A4: We benchmarked the inference speed of LLaVA-1.5-LLaMA3-8B on a TPU-V4 256, achieving a throughput of 382 images per second. At this speed, generating captions for the entire Recap-DataComp-1B dataset would require approximately 29 days with these computational resources. We will include the inference time in the next version.

审稿意见
5

Web-crawled image-text pairs are noisy and usually are not descriptive enough. This paper studies recaptioning DataComp-1B to obtain detailed and descriptive for images and shows the recaptions can bring benefits to downstream vision-language tasks such as retrieval and text-to-image generation. The authors first fine-tune a LLaMA-3 based LLaVA-1.5, then use it as an image captioner to generate long, detailed captions for images in DataComp-1B. They analyze the lengths distribution before and after this recaptioning step in DataComp-1B, showing the recaptions are longer. They also use LongCLIP and GPT4V to validate the image-text alignment and caption fluency are increased by this recaptioning step. Then, they use the recaptioned DataComp-1B, named Recap-DataComp-1B, to train CLIP models (Recap-CLIP) from scratch. Specifically, they mix the original captions and recaptions in DataComp-1B to find the best way to use the recaptions. In the experimental section, they evaluate the Recap-CLIP on downstream tasks including zero-shot imagenet classification and the text-to-image/image-to-text retrieval tasks. They find the detailed captions can benefit the retrieval tasks but negatively influence the zero-shot classification task. They also conduct experiments on long-caption image-text retrieval benchmarks. To further validate the quality of the recaptions, they train T2I models using recaptions and find the detailed recaptions can help T2I generation.

优点

  1. The downstream tasks are comprehensive. The authors study the capabilities of recaptioning through different downstream tasks, including zero-shot classification, retrieval and text to image generation. In most of the benchmarks, recpationing procedure brings positive effects.
  2. The authors conduct experiments on different model sizes, showing the consistent improvements in retrieval tasks.
  3. The recaptioned DataComp-1B can potentially help the development of VLMs in the future.

缺点

  1. Table 3 shows the ImageNet-1K zero-shot classification results and COCO/Flickr30k retrieval results. The authors should also provide the results when mixed ratio p = 0, which means the performances from the model trained using only recaptions. Obviously, the recaptions bring negative effects in the zero-shot classification task. The authors discuss the possible reason may be the missing of specific named entities. For example, in Figure 1, the example "Western Kingbird" is a valid bird class for the image, but in the detailed caption, it is described as "A small, gray and yellow bird with a black beak and black eyes...". Have the authors considered to generate the detailed captions based on the original caption as well? If the original caption can also be provided in the inputs, with a proper prompt, maybe the specific named entities can also be kept in the recaptions.
  2. In Table 6, it would be better if the authors can compare with MetaCLIP as well.

问题

Could the authors provide the total inference time on recaptioning the DataComp-1B?

评论

Thanks for the appreciation of our work. We address your concerns below:

Q1: Pure Recaption CLIP Training and Condition prompt.

A1: Thank you for raising this concern. First, we present the results of 100% synthetic caption training below:

ModelIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
B/1636.053.574.134.153.0

This result presents an interesting phenomenon: training exclusively with synthetic captions hurts CLIP performance significantly. More interestingly, as shown in Table 3 of the main paper, mixing a small percentage of original captions (e.g., 10%) can effectively address this performance issue. This result suggests that original captions still play a key role in CLIP training, which may deserve deeper investigation in future research.

Additionally, we clarify that while recent works like LaCLIP [1] and VeCLIP [2] demonstrate synthetic captions can enhance CLIP training, to our knowledge, no prior work has trained models exclusively with synthetic captions. This makes our trial the first public attempt, highlighting the challenges of such an approach.

Regarding the absence of specific named entities, we conduct an ablation study on a 30M subset of data with recaptions generated conditioned on the original captions (examples provided in Appendix E). The results presented in the table show that the improvement is limited. Also, when training solely on conditional synthetic captions, the performance still significantly degrades, highlighting the interesting challenges of training CLIP models exclusively on synthetic captions.

ModelCondition on Original CaptionMix RatioIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
L/16-1.066.148.665.330.241.7
L/16No0.667.561.177.839.554.0
L/16Yes0.668.262.078.740.155.5
L/16Yes0.034.743.366.628.544.0

Q2: Compare with MetaCLIP.

A2: Thank you for pointing out this related work. We compared MetaCLIP-L/14 with our Recap-L/14, as shown in the table below. Under the same model size, our Recap-L/14 achieves superior performance across all benchmarks. Notably, on retrieval tasks, our model shows an average improvement of 5.4%. We will include this detailed comparison in the next version of our paper.

ModelIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
Recap-L/1479.379.594.153.772.0
MetaCLIP-L/1479.276.490.147.164.4

Q3: Total inference time.

A3: We benchmark the inference speed of LLaVA-1.5-LLaMA3-8B on a TPU-V4 256, achieving a throughput of 382 images per second. At this speed, generating captions for the entire Recap-DataComp-1B dataset would require ~29 days with these computational resources. We will include the inference time in the next version.


Reference

[1] Fan, Lijie, et al. "Improving CLIP training with language rewrites." Advances in Neural Information Processing Systems 36 (2024).

[2] Lai, Zhengfeng, et al. "VeCLIP: Improving CLIP training via visual-enriched captions." European Conference on Computer Vision. Springer, Cham, 2025.

评论

Dear Reviewer TPv5,

We sincerely appreciate your review. We have carefully considered each of your questions and provided detailed responses in the rebuttal. Please let us know if you have any further questions or concerns.

Thanks!

评论

Dear Reviewer TPv5,

As the discussion deadline is approaching, we would like to check if our rebuttal successfully addresses your concerns. Also, feel free to let us know if you have more questions.

Thanks Authors

审稿意见
3

This paper presents image recaptioning on web-crawled data empowered by LLaMA-3. The proposed LLaVA-1.5-LLaMA3-8B replaces LLM in LLaVA with LLaMA-3 and is trained following LLaVA-1.5's setting. To enhance the caption quality in DataComp-1B, it adopts LLaVA-1.5-LLaMA3-8B for recaptioning. Extensive experiments and analyses are conducted, including the features of recaptions, the mixture usage of original and recaptions, the efficacy on recognition and generation, and the impact across different sizes of architectures.

优点

  • The computation on LLaMA-3-powered LLaVA training and recaptioning on 1B data is tremendous. The data generation and evaluation are at scale.
  • The modified and trained LLaVA-1.5-LLaMA3-8B model achieves comparable performance with the LLaVA-1.5-13B model across benchmarks for various tasks, including recognition, spatial awareness, OCR, and so on.
  • Quantitative analyses regarding caption features and semantic quality are conducted. Recap-DataComp-1B has longer, more diverse, and more aligned captions than the original ones.
  • The recaptioned data helps models with different sizes of text encoders.
  • The recaptioned data enhances the image generation quality.

缺点

  • The main innovation lies in the recaptioning process, which has been adopted in data synthesis for better quality [1,2] and representation learning [3]. It is especially similar to [4], image captioning on DataComp, but provides fewer insights. Although the scale is at the billion level, the technical contribution is limited.
  • The prompt will impact the quality of image captioning. In particular, the paper's main contribution is image recaptioning. It would be better if they explored more than one prompt. These experiments do not have to be at the 1B scale. It could be on a reasonable-scale subset.
  • Even though the proposed LLaVA-1.5-LLaMA3-8B is better than the previous LLaVA models on the MMMU and MM-Vet benchmarks (understanding/reasoning benchmarks), better image captioning ability and the necessity of LLaMA-3 are under verification. Additionally, there are no comparisons with recaptioned data from other captioning models. It is doubtful whether the data improvement comes from the captioning itself or it actually benefits from LLaMA-3.
  • It seems that the recaptioned dataset leads to much worse performance on ImageNet-1K and similar performance on the retrieval tasks. Table 3 illustrates model performance by training on different portions of original captions. Though the pure reception result (p=0) is not reported, inferring from the trend, it is likely to be worse than pure original caption-trained performance (p=1). The efficacy of recaptions is questionable.

[1] Improving Image Generation with Better Captions. [2] A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation. [3] LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data? [4] Improving multimodal datasets with image captioning.

问题

  • How is the current prompt chosen? How will different prompts for captioning influence the data quality?
  • Compared to other captioning models on the image captioning benchmark, how much better is the LLaVA-1.5-LLaMA3-8B? Compared to the original LLaVA recaptioned (or other model recaptioned) datacomp image-text pairs, what is the advantage of LLaVA-1.5-LLaMA3-8B?
  • In Table 3, what are the results when training on exclusive recaptions (p=0)? If the performance is worse than p=1, what are the possible reasons?
评论

Thanks for raising many helpful comments. We address your concerns below:

Q1: Novelty.

A1: We would like to stress that our primary focus is on creating Recap-DataComp-1B, which, to the best of our knowledge, is the first publicly available image-text dataset with synthetic captions scaled to the billion level using LLaMA-3. We believe this represents a novel and significant contribution to the multimodal research community. While the concept of recaptioning is not new, scaling it to this magnitude with advanced LLMs is not seen before. More importantly, this large-scale dataset enables the first public, extensive investigations into training CLIP and T2I diffusion models with high-quality synthetic captions. For example, our results comprehensively demonstrate that Recap-DataComp-1B significantly enhances cross-modal tasks, long-context understanding, and text-to-image generation.

Furthermore, as shown in our added rebuttal experiments, our dataset consistently improves the performance of other CLIP-family models, like LaCLIP [1]. It can also serve as an improved pre-training dataset for LLaVA-family models. Based on this evidence, we believe that Recap-DataComp-1B is a novel and important contribution to the community, with the strong potential to provide significant benefits to future multimodal research.


Q2 / Q5: Prompt Ablation - How was the prompt chosen?

A2: Thanks for raising this concern. Following your suggestion, we conducted additional ablation on a subset of 30 million image-text pairs. Specifically, we tested four types of prompts to evaluate their impact on the performance of CLIP models trained on the resulting captions. Examples of recaptions are included in Appendix E, and detailed descriptions of these prompts are provided below:


Prompts for Image Captioning

1. Original Prompt

Please generate a detailed caption of this image. Please be as descriptive as possible.

2. Concise Prompt

Please generate a short and clear explanation of the image. Please be as concise as possible.

3. Condition Prompt

Please generate a detailed caption of this image. Please be as descriptive as possible based on original caption: [ORIGINAL DATACOMP CAPTION].

4. Diverse Prompts Pool

  • Describe the image concisely as short as possible.
  • Provide a brief description of the given image.
  • Offer a succinct explanation of the picture presented.
  • Summarize the visual content of the image.
  • Give a short and clear explanation of the subsequent image.
  • Share a concise interpretation of the image provided.
  • Present a compact description of the photo’s key features.
  • Relay a brief, clear account of the picture shown.
  • Render a clear and concise summary of the photo.
  • Write a terse but informative summary of the picture.
  • Create a compact narrative representing the image presented.

Our findings, presented in the table below, show that the choice of prompt has a clear influence on CLIP model performance. Notably, these newly designed prompts outperform the baseline, with the "diverse prompt" approach yielding the best result.

ModelPrompt TypeMix RatioIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
L/16-1.066.148.665.330.241.7
L/16Original-prompt0.667.561.177.839.554.0
L/16Concise-prompt0.667.861.580.640.557.1
L/16Diverse-prompt0.668.263.781.442.357.3
L/16Condition-prompt0.668.262.078.740.155.5

We emphasize that, even with the most vanilla prompt setup, our recaptioned dataset has already shown superior performance in retrieval tasks compared to models trained on in-house datasets such as WebLI-5B.With the new results above, we are now more confident that our work is valuable for enabling future research to build higher-quality datasets by leveraging the latest advancements, such as new prompt engineering techniques or stronger LLMs.

We will add these results and discussions in the future work section.

评论

Q3 / Q6: Caption model ablation. Where does the data improvement come from? What is the advantage of LLaVA-1.5-LLaMA3-8B?

A3: Our primary goal is to strike a good balance between efficiency and effectiveness. Our current choice of captioning model offers generation of decent quality at an affordable cost. It outperforms the vanilla LLaVA-1.5-7B on MMMU and MMVet by 3.9 and 2.6 while running at a similar speed (395 img/s vs 382 img/s). Also, our extensive experiments on CLIP training and T2I generation models highlight the benefit of our Recap-DataComp-1B dataset (Table 4/6/7 in the paper), indicating the good quality of the generated captions.

On the other hand, it is a conceptually straightforward way to increase the recaption model capability by scaling. Thus, we also compare our LLaVA-1.5-LLaMA3-8B with the publicly available OpenLLaVA-next-LLaMA3-8B model and a larger LMM, LLaVA-next-Gemma2-27B, trained on the same codebase [1]. Using identical hardware (TPU V4-256), the total estimate inference time of the whole DataComp-1B dataset for LLaVA-1.5-LLaMA3-8B, OpenLLaVA-NeXT-LLaMA3-8B, and LLaVA-NeXT-Gemma2-27B is 29, 77 and 201 days, respectively. We chose LLaVA-1.5-LLaMA3-8B to keep the total inference cost manageable.

Furthermore, we present an intriguing observation that higher LMM performance does not necessarily translate to better captioning quality. The OpenLLaVA-NeXT-LLaMA3-8B and LLaVA-NeXT-Gemma2-27B models achieve MM-Vet scores of 40.5 and 46.6, compared to 36.5 MM-Vet scores of LLaVA-1.5-LLaMA3-8B. However, as shown in the following table, the CLIP model trained with 30M LLaVA-1.5-LLaMA3-8B recaptioned DataComp-1B samples consistently achieves better performances across classification and retrieval benchmarks.

Note that the above preliminary analysis does not indicate scaling the captioning model does not work. We believe it is entirely possible to observe performance boost upon caption model scaling given more advanced synthetic caption training techniques. Our argument is that LLaVA-1.5-LLaMA3-8B generates decent captions even compared to those larger models while running significantly faster. It has already been shown to be beneficial to a lot of image-text learning tasks in our experiments. Therefore, we opted for LLaVA-1.5-LLaMA3-8B to generate synthetic captions for the whole 1B dataset.

ModelCaption ModelCaption SpeedMix RatioIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
L/16--1.066.148.665.330.241.7
L/16LLaVA-1.5-LLaMA3-8B382 img/s0.667.561.177.839.554.0
L/16LLaVA-next-LLaMA3-8B142 img/s0.666.860.277.437.552.8
L/16LLaVA-next-Gemma2-27B54 img/s0.667.158.974.637.051.8

Q4 / Q7: Efficacy of recaptions and missing results for p=0.

A4: Thank you for raising this concern. We address the question regarding pure recaption CLIP training in our common response and kindly ask you to let us know if it resolves your concerns.

Additionally, we conducted further experiments by training CLIP solely on recaptions. We use the 30M subset mentioned in Q2. The results presented in the table show that simply modifying prompts, even conditioning on original captions, does not resolve the challenges. These findings highlight that achieving optimal performance with purely synthetic captions remains an interesting and open challenge.

ModelPrompt TypeMix RatioIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
L/16-1.066.148.665.330.241.7
L/16Original-prompt0.027.837.859.525.041.5
L/16Concise-prompt0.028.135.659.923.744.0
L/16Diverse-prompt0.031.244.865.030.846.5
L/16Condition-prompt0.034.743.366.628.544.0

Reference

[1] Lin, Chen, and Xing Long. "Open-llava-next: An open-source implementation of llava-next series for facilitating the large multi-modal model community." GitHub-xiaoachen98/Open-LLaVA-NeXT: Anopen-sourceimplementationfortrainingLLaVA-NeXT (2024).

评论

Dear Reviewer XLPB,

We sincerely appreciate your review. We have carefully considered each of your questions and provided detailed responses in the rebuttal. Please let us know if you have any further questions or concerns.

Thanks!

评论

I appreciate the authors' response and additional experiments.

The paper has strengths, particularly the massive efforts to generate captions at the billion scale. However, there are two major concerns remain unsolved after their rebuttal.

  • Novelty and Recaptioning. The authors state that although there are existing papers doing recaptioning at 1B scale [1], their main novelty lies in "the first publicly available image-text dataset with synthetic captions scaled to the billion level using LLaMA-3". In this case, it would be better to delve deeper into the properties of synthetic captions, such as similarities, diversity, and coverage. Simply using a more advanced backbone model or publicity can not provide enough insight for future works. Moreover, given the fact that purely relying on their synthetic captions hurts the performance (rebuttal experiment), it will be more convincing if the paper shares findings about what types of synthetic captions are more effective.
  • Improvement and Recaptioining. It is well-established that recaptioning can help. The paper proposes a new architecture integrated with LLaMA-3. However, the authors did not directly answer the correlation between captioning quality and performance improvement, even during the rebuttal. The two benchmarks they used to evaluate their model, MMMU and MM-Vet benchmarks (understanding/reasoning benchmarks), are irrelevant to captioning. In other words, it is likely that without the massive computation on LLaMA-3 model, applying other powerful captioning models on Datacomp1B instead can achieve good performance. It is necessary to compare the captioning ability of the proposed model with other captioning models or compare the performance from training on their synthetic captions with other models' synthetic captions.

Overall, while the paper took a good first step, it has several pivotal logical issues. I encourage the authors to keep working on it and look forward to their revision.

[1] Improving Multimodal Datasets with Image Captioning.

评论

Dear Reviewer XLPB

Thank you for your thoughtful feedback and for recognizing the strengths of our paper, particularly our efforts in generating captions at the billion-scale. We appreciate the opportunity to address your concerns and provide further clarification.

  1. We respectfully disagree that "It is well-established that recaptioning can help". While there is a general understanding that recaptioning can be beneficial --- as noted in some blogs or brief technical reports from companies with closed-weight models --- the research community actually has a limited grasp of this topic, e.g., we lack detailed knowledge about how recaptioning helps and to what extent, especially concerning the training of image-text foundation models. Our work aims to significantly bridge this gap --- by releasing our dataset to the community, we hope to motivate extensive follow-up research that will firmly establish this field, like exploring how to leverage synthetic captions to enhance model training and improving existing recaptioning pipelines to obtain higher-quality synthetic captions.

  2. We would like to clarify that we intentionally kept all components—such as the recaptioning pipeline and the model training protocol—as simple as possible. The purpose here is to maximally demonstrate that, by having these synthetic captions at scale, the most straightforward and naive setting already yields strong performance. We believe this may be the best way of presenting to let the public believe in the strong potential of this pipeline and the released dataset and to motivate more future research in this direction.

  3. Lastly, we genuinely appreciate your comments, which we find very interesting and worth exploring. However, rather than interpreting these points as weaknesses of our paper, we prefer to view them as successful inspirations from our work for future research directions. We strongly believe that each of these topics alone could contribute to a strong paper --- for example, see one of our follow-up works on leveraging synthetic captions to train the SOTA CLIP models ( https://drive.google.com/file/d/1zKxArcescErVbNvgALAEmM6DgDYlFOdZ/view?usp=sharing ) --- and should not be integrated into this paper to complicated the main narrative.

We hope these additional responses/clarifications can address your concerns. Authors of 7751

审稿意见
8

The paper proposes a version of DataComp-1B wherein images are re-captioned by a LLaVA fine-tune of Llama3-8B. The paper trains CLIP models on this dataset demonstrating improvements in image retrieval and regressions on classification. The authors also demonstrate applications of the dataset for text2image generation.

优点

S1. Well written and clear.

S2. Well positioned in the literature.

S3. Extremely non-trivial engineering effort, which I consider a strength.

S4. Very useful dataset contribution (both in terms of quantity of data and scale), opening many future avenues in synthetic data and studying empirical phenomenon associated with training on this kind of data.

S5. Simple approach allowing the authors to study empirical phenomenon, which is the heart of the science in this paper.

S6. Further validation of the LLaVA recipe and it's usefulness for creating VLMs

缺点

W1. What is "Word distributions of our re-captions and the original captions." in Figure 1? Could you clarify this in the caption. Might also be too much information for this figure and you can consider moving that.

W2. Consider including some numerical outcomes of the dataset in the abstract/intro. e.g., training a CLIP model on [ours] shows a XX percentage point improvement when compared to training on [baseline] on [benchmark].

W3. Missing citation. MINT-1T. Awadalla, et al. 2024.

W4. More evaluation. Could be helpful to add some more evals, for example the average score from the DataComp paper for the CLIP models: https://github.com/mlfoundations/datacomp?tab=readme-ov-file#evaluation. Are there any salient trends in specific datasets and domains? For example, I imagine there could be boosts or hits in non-ImageNet tasks, which seem important to document.

W5. Can the proposed dataset be used in the LLaVA training mix, replacing the LAION fraction, to get a better LLaVA model (conceptually, as a data flywheel).

W6. Given that this is a large download, it could be good to provide statistics for the community about how many images were actually downloaded––to give the community an idea about the link rot affecting DataComp.

W7. The method itself is not so different from that of other papers (e.g., Nguyen et al. Improving multimodal datasets with image captioning. 2023). However, given that the contributions of this paper are framed properly as a dataset and empirical results, I consider this a very minor weakness. In fact, if other reviewers complain heavily about novelty, I am willing to go to bat for this paper as I feel this criticism is misguided and overlooks the core contributions.

问题

Please see Weaknesses. Questions are included their with the corresponding "Weakness".

评论

Thanks for strongly supporting our paper. We address your concerns below:


Q1: Word Distribution

A1: Thanks for your question. In Figure 1, "Word distributions" refers to the frequency distributions of words appearing in the captions from both datasets. Specifically, it illustrates how often different words are used in our re-captioned data compared to the original captions. We also include detailed clarification in Section 4.1.


Q2 / Q3: Writing and Related Work

A2 / A3: Thank you for the suggestion. We have updated our abstract and included this related work.


Q4: More Evaluation

A4: Thank you for the valuable suggestion. We conducted additional evaluations of our CLIP-L/14 model using the Datacomp evaluation tool [1] and compared it to the same model size (L/14) released in [1]. The average scores are presented in the table below. Our model achieves superior performance on 23 out of 38 datasets, particularly excelling in distribution shift and retrieval tasks. These results will be included in our next version.

ModelDatasetAvg. on 38 datasetsIN1KIN1K dist. shiftsVTABRetrieval
ViT-L/14DataComp66.379.267.967.460.8
ViT-L/14Recap-DataComp66.679.371.165.964.8

Q5: LLaVA Training Mix

A5: Thank you for the insightful suggestion. Following your advice, we replaced the pre-training dataset of LLaVA 1.5, which originally consists of 558K concept-balanced image-caption pairs from LAION, CC-3M, and SBU, with randomly selected 558K samples from our Recap-DataComp-1B dataset. To ensure a fair comparison, we only updated the pre-training data while retaining the same pre-training recipe. The fine-tuning stage is exactly the same as LLaVA-1.5. As shown in the table below:

Pre-train DatasetTunable ModuleTraining StepsTextVQAMMEVQA-V2MM-Vet
LLaVA-LCS-558KProjector2K59.11489/27777.934.4
Recap-DataComp-1B (Recaption Only)Projector2K60.11523/26078.635.1

We observed that pre-training LLaVA on our synthetic dataset results in a non-trivial performance boost, highlighting the quality and scalability of our dataset. Notably, these results were achieved with random samples from Recap-DataComp-1B without any data curation, suggesting that further improvements are possible with a data filtering process.


Q6: Dataset Statistics

A6: Our download date for this version is November 2023, with a total of 940,891,257 valid images successfully downloaded after de-duplication and validity checks. The download success rate for valid images stands at approximately 67.8%. Detailed statistics will be provided in the next version.


Q7: Prior Work (Nguyen et al.)

A7: Thanks for correctly understanding the major contribution of this paper! While highly related to prior recaption works, our focus is on creating such a dataset at scale using an advanced LLM and conducting a comprehensive study on how this dataset benefits the training of various image-text models.

Reference

[1] Gadre, Samir Yitzhak, et al. "Datacomp: In search of the next generation of multimodal datasets." Advances in Neural Information Processing Systems 36 (2024).

评论

Thanks for running the requested experiments! The results indeed look very nice! I maintain my rating for now and strongly suggest that ACs look into the details of this paper.

评论

We are glad to see your concerns are well addressed by our rebuttal!

Thanks again for your strong support!!

评论

We thank all reviewers for their thoughtful feedback, which will help us improve the quality of this paper. We are encouraged by the recognition of our contributions:

Dataset Contribution

  • Reviewer 47Fv: "Very useful dataset contribution in terms of scale and quality."
  • Reviewer TPv5: "Recaptioned DataComp-1B can advance VLM development."
  • Reviewer mTNK: "Recap-DataComp-1B provides a cleaner, more aligned dataset for training."

Recaptioning Pipeline

  • Reviewer XLPB: "LLaVA-1.5-LLaMA3-8B achieves comparable performance to LLaVA-1.5-13B."
  • Reviewer TPv5: "Consistent retrieval task improvements across model sizes."
  • Reviewer mTNK: "Pipeline improves discriminative and generative model performance."

Validation and Impact

  • Reviewer 47Fv: "Further validation of LLaVA’s utility for VLMs."
  • Reviewer TPv5: "Comprehensive tasks demonstrate the benefits of recaptioning."
  • Reviewer mTNK: "Clear improvements in zero-shot retrieval and text-to-image generation."

Engineering Effort

  • Reviewer 47Fv: "Non-trivial engineering effort is a key strength."
  • Reviewer XLPB: "Tremendous computation effort on LLaMA-3 recaptioning."

We are motivated by this feedback and will address concerns to further enhance our work. Here we address commonly raised concerns:


Q1: Novelty (Reviewer XLPB, mTNK)

A1: We would like to stress that our primary focus is on creating Recap-DataComp-1B, which, to the best of our knowledge, is the first publicly available image-text dataset with synthetic captions scaled to the billion level using LLaMA-3. We believe this represents a novel and significant contribution to the multimodal research community. While the concept of recaptioning is not new, scaling it to this magnitude with advanced LLMs has not been seen before. More importantly, this large-scale dataset enables the first public, extensive investigations into training CLIP and T2I diffusion models with high-quality synthetic captions. For example, our results comprehensively demonstrate that Recap-DataComp-1B significantly enhances cross-modal tasks, long-context understanding, and text-to-image generation.

Furthermore, as shown in our added rebuttal experiments, our dataset consistently improves the performance of other CLIP-family models, like LaCLIP [1]. It can also serve as an improved pre-training dataset for LLaVA-family models.

Based on this evidence, we believe that Recap-DataComp-1B is a novel and important contribution to the community, with the strong potential to provide significant benefits to future multimodal research.


评论

Q2: Pure Recaption for CLIP Training (Reviewer XLPB, TPv5, and mTNK)

A2: We appreciate the reviewers raising concerns about training on purely synthetic captions. First, we present the results of 100% synthetic caption training below:

ModelIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
B/1636.053.574.134.153.0

This result presents an interesting phenomenon: training exclusively with synthetic captions hurts CLIP performance significantly. More interestingly, as shown in Table 3 of the main paper, mixing a small percentage of original captions (e.g., 10%) can effectively address this performance issue. This result suggests that original captions still play a key role in CLIP training, which may deserve deeper investigation in future research.

Additionally, we clarify that while recent works like LaCLIP [1] and VeCLIP [2] demonstrate synthetic captions can enhance CLIP training, to our knowledge, no prior work has trained models exclusively with synthetic captions. This makes our trial the first public attempt, highlighting the challenges of such an approach.

To address the concern about relatively low usage of recaptions in our experiments, we conducted additional evaluations with enhanced training methods like LaCLIP-MT, which leverages multi-positive contrastive loss to train with 100% synthetic recaptions and 100% original captions. The results are presented below:

ModelMethodUsage of Our RecaptionsIN1KFlickr T2IFlickr I2TCOCO T2ICOCO I2T
B/16Baseline0%70.564.484.138.959.5
B/16Mix-training (ours)20%69.867.186.742.762.8
B/16LaCLIP-MT100%65.871.689.944.965.4

Note:

  • LaCLIP-MT: This is the Multi-Text version of LaCLIP as described in Section 5 of the original paper. In implementing the multi-positive loss, we always include one original caption and one synthetic caption.

As shown, utilizing 100% synthetic recaptions with LaCLIP-MT leads to substantial improvements in retrieval benchmarks, demonstrating the utility of our dataset. While there is a slight decrease in ImageNet-1K zero-shot classification accuracy, this is consistent with observations in prior work [2,3] and highlights the need for further research into effective training techniques with fully synthetic captions.

Additionally, as suggested by Reviewer 47Fv (Q5), we replaced the pre-training dataset of LLaVA 1.5 [4], which originally consists of 558K concept-balanced image-caption pairs from LAION, CC-3M, and SBU, with randomly selected 558K samples from Recap-DataComp-1B. To ensure a fair comparison, we only updated the pre-training data while retaining the same pre-training recipe. Fine-tuning was performed as in LLaVA-1.5. The results are:

Pre-train DatasetTunable ModuleTraining StepsTextVQAMMEVQA-V2MM-Vet
LLaVA-LCS-558KProjector2K59.11489/27777.934.4
Recap-DataComp-1B (Recaption Only)Projector2K60.11523/26078.635.1

We observed a non-trivial performance boost when pre-training on our synthetic dataset, indicating its potential as a valuable resource. These results were achieved with random samples from Recap-DataComp-1B without data curation, suggesting further improvements are possible with a filtering process.

We hope these new results address the reviewers' concerns and increase confidence in the quality and value of our proposed dataset.

Note: All manuscript modifications are highlighted in blue. We are happy to address further questions or concerns during the rebuttal period.

Reference

[1] Fan, Lijie, et al. "Improving CLIP training with language rewrites." Advances in Neural Information Processing Systems 36 (2024).
[2] Lai, Zhengfeng, et al. "VeCLIP: Improving CLIP training via visual-enriched captions." European Conference on Computer Vision. Springer, Cham, 2025.
[3] Zhang, Beichen, et al. "Long-CLIP: Unlocking the long-text capability of CLIP." European Conference on Computer Vision. Springer, Cham, 2025.
[4] Liu, Haotian, et al. "Improved baselines with visual instruction tuning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

AC 元评审

This paper shows the empirical impact of the re-captioned DataComp-1B dataset with LLaVA-1.5 fine-tuned based on Llama3-8B. Using the re-captioned DataComp-1B, this paper achieves a better CLIP model in terms of ImageNet classification and cross-modal retrieval. In addition, this paper shows that by choosing the proper mixing ratio between the re-captioned captions and the original captions, the text-ot-image generation (DiT) shows better FID and CLIP scores.

This paper has mixed opinions, mostly negative ones. There were several concerns raised by the reviewers. I listed the main concerns of the reviewers and my opinions on each item.

  • Lack of novelty (XLPB, mTNK): I partially agree that the use of re-generated captions for training is not a novel contribution. However, the main contribution of this paper is more empirical; as the title, we do not know "What if We Recaption Billions of Web Images with LLaMA-3?" This paper investigates this question and if the re-captioned dataset and trained models on the dataset show important empirical findings, I think this paper has enough empirical contribution. However, as my following opinions, I think the empirical findings are somewhat weak to be published as an ICLR paper as is.
  • Lack of comparison benchmarks (mTNK): I partially agree with this comment. The authors provided an additional result based on 38 tasks by DataComp. In the experiments, we can observe that the CLIP model trained on the re-captioned DataComp-1B performs worse in 14 VTAB tasks with a non-neglectable gap (67.4 vs. 65.9). This shows that the re-captioning does not perform better than the original caption, contradictory to the main argument of this submission.
  • It is unclear whether the main contribution comes from LLaMA-3 or re-captioning (XLPB, mTNK): Similar to the first item, this could be handled if there exists enough empirical findings. I think comparing all the possible LLMs or captioners for this task is not feasible. However, it is also true that the current design choices (e.g., the captioning model and its base LLM and generation prompt) are not rigorously studied. I understand that this costs a lot (it already takes 29 days with TPU-V4 256), but we may need a more systematic approach for this. The additional results provided by the authors during the rebuttal may address this concern partially, but the revised paper does not include this result.

Overall, I think that although this paper tackles a very interesting question "What if We Recaption Billions of Web Images with LLaMA-3?", but the answer by this paper looks somewhat weak. Namely, it is unclear what we can expect from the re-captioned dataset and the trained model by the dataset. It is because (1) to get a better result, we need many design choices, especially for the mixing parameter pp, (2) the newly trained model is not overwhelming the existing model (it shows a worse performance for VTAB), and (3) the overall re-captioning process is too expensive (~29 days with TPU-V4 256). Considering the expensive prediction cost of re-captioning and parameter search cost (different pp needs different CLIP training), the improvement is somewhat weak and the finding looks somewhat weak.

审稿人讨论附加意见

During the discussion phase, the authors clarified the contribution of their submission and provided additional experimental results (ablation study on a 30M subset of data with recaptions generated conditioned on the original captions, comparison with MetaCLIP, CLIP trained with different captioning prompts and captioners, and additional benchmarks such as the datacomp evaluation suite)

However, most of the reviewers did not change their initial opinions. I partially agree with them (the current submission may need more empirical findings and insights) and partially disagree with them (using the existing building blocks is not novel). My detailed opinions for the rejection is described in the meta-review.

最终决定

Reject