PaperHub
7.5
/10
Spotlight4 位审稿人
最低6最高8标准差0.9
8
8
8
6
4.0
置信度
正确性3.0
贡献度3.3
表达3.3
TL;DR

We introduce OmniCorpus, the first 10 billion-level image-text interleaved dataset with diverse sources, providing a robust foundation for future multimodal model research.

摘要

关键词
Image-text interleaved dataset

评审与讨论

审稿意见
8

This manuscript introduces OmniCorpus, a massive multimodal dataset consisting of 10 billion-level images interleaved with text. This dataset is designed to support the development of multimodal large language models by providing a more diverse and larger scale of image-text data compared to existing datasets. The contributions of this manuscript include the introduction of the largest multimodal dataset, a set of tools and algorithms for data processing, and extensive experiments that validate the dataset's quality and effectiveness. The authors conducted experiments to explore the effectiveness of image-text interleaved data for few-shot capabilities and language model maintenance. They also compared OmniCorpus with other datasets and found that their dataset outperforms others in terms of quality and diversity.

优点

Large Scale: The dataset boasts an unprecedented scale of 8.6 billion images and 1.696 trillion text tokens, making it the largest multimodal dataset available.

Diversity: OmniCorpus includes data from a wide range of sources, including both English and non-English websites, as well as video-centric platforms, which enhances the diversity of the dataset.

Usability: The dataset has been validated through comprehensive analysis and experiments, demonstrating its quality, usability, and effectiveness.

Writing: This manuscript is well-written, with clear motivation and solid experimental discussions.

缺点

Bias: The paper acknowledges potential biases in the dataset but does not provide a detailed analysis of these biases.

Filtering Mechanisms: The current filtering process may not be sufficient to ensure high-quality data.

问题

Will all the data processing code and the data be open-sourced?

评论

Q1: The paper acknowledges potential biases in the dataset but does not provide a detailed analysis of these biases.

A1: Collecting large-scale image-text data from the internet is widely regarded as a effective yet imperfect approach for scaling-up the large vision language model. While we have identified and filtered high-risk sources of bias, we acknowledge that potential biases still exist in the dataset. Similar to other web-crawled datasets (e.g., OBELICS[1], LAION[2], and MMC4[3]), the source data inherently contains systemic social biases (e.g., subpopulation depiction and racism), which are difficult to completely eliminate through automated processing.

To analyze these potential biases, we computed the normalized Pointwise Mutual Information (nPMI) metric on 500k documents using the data-measurements-tool. The nPMI quantifies the association between terms, categories, or attributes, revealing potential biases by highlighting statistically significant relationships. For example, if certain word pairs have high nPMI scores, this suggests they frequently co-occur in the data, potentially reflecting stereotypes or biases.

The results are uploaded to supplementary material (bias_analysis folder, we conduct calculation for twice, as in npmi_1 folder and npmi_2 folder). Given the extensive analysis results, we exhibit one biases analysis focused on gender-related content. We will make the detailed analysis results and analysising methodology available in our open-source repository, allowing users to review and consider them before use. Below, we focus on nPMI scores for terms associated with female (“she” and “woman”, in combined-she-woman.md) and male (“he” and “man”, in combined-he-man.md). The results indicate that male-associated terms are more frequently linked with topics like military, politics, and racing, while female-associated terms are often related to animals and domestic life.

We encourage further research into bias analysis and mitigation strategies for internet-sourced image-text datasets. Additionally, we are committed to continuously maintaining the dataset and updating it with improved solutions as they become available.

Q2: Filtering Mechanisms: The current filtering process may not be sufficient to ensure high-quality data.

A2: We have spared no effort to ensure the data quality. As demonstrated by comparison results in Table 5, the quality of our documents are higher than the counterparts. As described in Appendix C.4 of the revised manuscript, we strive to improve data quality by using a more strict filtering process than previous large-scale multimodal corpora. The human feed-back filtering is currently the most effective method for improving data quality significantly. The rules were iteratively refined to ensure that most unexpected content is filtered while the false positive rate are minimized. Hence, The data quality is ensured through substantial manual processing. Additionally, it is flexible for the user to further enhance data quality by filtering based on metadata according to specific requirements.

Q3: Will all the data processing code and the data be open-sourced?

A3: Yes, as is described in '1 Introduction' section and 'A.2 Release And Maintaining' section. We follow common practices of dataset research, such as OBELICS[1], to release our work. We upload all the processed documents to public data hosting platforms. In additional to releasing data, we shall uphold the transparency in data collection and the reproducibility of model results. The developed human-feedback filtering functions and enhanced mainbody extraction tools will be available. The code for interleaved image-text pre-training with OmniCorpus, along with scripts for few-shot evaluation, will also be provided in the GitHub repository.

Reference:

[1] Obelics: An open web-scale filtered dataset of interleaved image-text documents

[2] LAION-5B: An open large-scale dataset for training next generation image-text models

[3] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

审稿意见
8

This paper introduces OmniCorpus, a 10 billion-level open-source image-text interleaved dataset. And it proposes an efficient data engine. It also conducts comprehensive analysis and experiments.

优点

  1. The largest open-source multimodal dataset to date. It pushes the boundaries of scale and diversity by encompassing 8.6 billion images interleaved with 1,696 text tokens.
  2. A comprehensive set of tools and algorithms, including a streaming data format that unifies multimodal data from various sources, an efficient and scalable data engine capable of processing large-scale data, and human feedback filters to ensure high-quality data.
  3. comprehensive analysis and experiments.

缺点

No obvious shortcomings observed.

问题

No obvious shortcomings observed.

评论

Thank you for your feedback. We appreciate your recognition of our work.

审稿意见
8

The paper introduces a novel multi-modal and bilingual corpus at billions scale with image and text interleaved format. The dataset has three subsets (CC, CW, YT). The data processing and filtering steps for each subset is carefully designed and clearly explained. Extensive experiments are performed to access the both quality and the value of the dataset for multi-modal large language modeling tasks. Several pre-training and fine-tuning experiments ablations demonstrate the value of the proposed dataset.

优点

  • Largest open source multi-modal dataset to date (8.6B images interleaved with text)
  • Describes a detailed framework to collect and curate large multi-modal datasets at scale.
  • Extensive ablations showing the value of the dataset with interleaved text format for few shot and other multi-modal understanding tasks.

缺点

  • It is not 100% clear, if one can replicate the same data collection process and produce similar quality datasets. The models used for filtering content, the thresholds and potentially other important details seem to be missing.
  • Table metrics and abbreviations need to be explained for clarity.

问题

A few quick notes for authors to improve the paper:

  1. Table metrics and abbreviations should be clearly explained in the text where applicable.
  2. minihash -> MinHash. We also need to understand what hash functions used to better understand how the dedup is performed for repeatability.
  3. Table 5 - Shall the authors do a deep dive for the few shot evaluations with the proposed pre-training dataset (especially for COCO dataset)? We need deeper understanding of why the few shot metric improvements are so high. Examples would also help especially for those the baseline models fail but the model pre-trained with OmniCorpus performs better.
  4. Do the authors plan to release the code used for dataset generation?

伦理问题详情

There is no mention if the EU data is included or not (or going through a different treatment or not). If EU data is included, we need to make sure that the GDPR rules are followed. Not a legal expert here, but flagging this potential issue for visibility.

评论

Q5: Do the authors plan to release the code used for dataset generation?

A5: Yes, as is described in '1 Introduction' section and 'A.2 Release And Maintaining'. In additional to releasing all the processed data, we shall release the code of processing English and Chinese documents, such as, the developed human-feedback filtering functions and enhanced mainbody extraction tools.

Q6: Concern on GDPR.

A6: We have made extensive efforts to ensure the compliance and legality of our dataset collection process.

First, our data sources are carefully selected to minimize risks. OmniCorpus-CC is based on Common Crawl, a widely used resource in datasets like OBELICS[10] and LAION[1], which collects publicly accessible web content with inherently low risks of including sensitive personal information. OmniCorpus-CW comprises Chinese internet data sourced entirely outside the European Union, eliminating concerns related to GDPR and fully complying with relevant laws of any country and terms of use. OmniCorpus-YT is derived from existing open video datasets, whose compliance has been validated in prior works.

Furthermore, we prioritize privacy by actively removing sensitive content, including personal identifiers, phone numbers, bank account details, email addresses, social media handles, and any content with opt-out signals. As is described in the datasheet in Appendix A.1.7, we shall delete specific samples that would be identified as sensitive.

As a result, our dataset does not include EU data that raises GDPR concerns. Additionally, we will actively maintain the dataset to further enhance privacy and safety according to subsequent research and community feedback.

Reference:

[1] LAION-5B: An open large-scale dataset for training next generation image-text models

[2] Bert: Pre-training of deep bidirectional transformers for language understanding

[3] WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

[4] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

[5] Towards VQA Models That Can Read

[6] Microsoft COCO Captions: Data Collection and Evaluation Server

[7] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

[8] On the resemblance and containment of documents

[9] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

[10] Obelics: An open web-scale filtered dataset of interleaved image-text documents

评论

Q1: Concern on missing details and thresholds of the data collection process.

A1: In the revised manuscript, we've included the missing details and thresholds:

(1) We provide the source and thresholds of the models used for filtering in Section 3.1. We filtered out low-quality images using the LAION-Aesthetic Predictor[1] with a threshold of 3.7 and the LAION-NSFW Detector[1] with a threshold of 0.8. For detailed text filtering, we use several BERT[2] classifiers of WanJuan-CC[3] to score advertisement content, political content, toxic content, NSFW material, and document fluency.

(2) We add a subsection named "preliminary text filtering" in Appendix C.3 of the revised manuscript, including the detailed description and thresholds of the filtering functions. For more details, please refer to the revised manuscript.

We have added as many details as possible in the revised appendix. We summarize the descriptions here: Section 3 introduces the five key stages of the overall pipeline, the key improvements of our pipeline, the procedure of human-feedback filtering and the streaming data format. Appendix C.3 presents the descriptions and thresholds of the process in "preliminary text filtering" stage. Appendix C.2 present the description and false positive rate of the human-feedback filters in "detailed text filtering" stage. More details will also be availiable in our open-source repository.

Q2: Table metrics and abbreviations should be clearly explained in the text where applicable.

A2: Thanks for the suggestions. We have clarified the metrics and abbreviations in the revised manuscript.

Specifically, in Table 1, "#" denotes "The number of". (modified in the caption of Table 1)

In Table 2-4, "Avg. MLLM acc." means the mean value of the scores on OKVQA[4], TextVQA[5], COCO[6], and Flickr30k[7]. (added to the "Evaluation" paragraph of Section 5.2)

In Table 2-5, "#Shot" means the number of in-context examples. (added to the "Evaluation" paragraph of Section 5.2)

Q3: minihash -> MinHash. We also need to understand what hash functions used to better understand how the dedup is performed for repeatability.

A3: Thanks, we'll fix this typo in the manuscript.

We used minhash[8] to comparing the text content and remove the duplicate documents with a threshold of 0.8, which discarded approximately 90% of duplicates. We computed perceptual hash (phash) and difference hash (dhash) values of the images and remove the images that appeared more than 10 times across all the images.

Q4: A deep dive for the few shot evaluations.

A4: We upload comparison examples of our model and OpenFlamingo-9B[9] on 4-shot COCO[6] and OKVQA[4] dataset to supplementary material (few_shot_examples.pdf).

These few-shot examples demonstrate that our model produces fewer hallucinations and more detailed descriptions in the image captioning task. Moreover, it provides more accurate and concise answers in the visual question-answering task. This validates that our dataset enhances the model’s contextual learning ability and text generation quality, which can be attributed to the higher quality and stronger contextual relevance of our documents.

评论

The authors have addressed my questions with sufficient level of detail. Although I did not change the overall rating, I increased the confidence level of the prior assessment.

审稿意见
6

The paper presents OmniCorpus, a large multimodal (text and vision) and multilingual (English and Chinese) dataset containing bilions of images interleaved with trilions of tokens. The paper explains how the data was obtained, filtered, and formated, and presents several experiments conducted on the dataset (i.e. training a VLM on the dataset from existing publicly-available vision encoders and text decoders).

优点

  • The paper presents OmniCorpus, a large multimodal (text and vision) and multilingual (English and Chinese) dataset containing hundreds of millions of documents (bilions of images, and trilions of tokens). This is by far the largest publicly available dataset that I know of, which can increase the amount of data available to conduct research on Vision-Language Models.
  • The dataset has been carefully deduplicated and filtered to prevent NSFW content, personal information, offensive content, etc.

缺点

  • As with many other datasets using crawled data from the Internet, it's not clear if 1) the authors of the paper themselves followed the terms of use of the sources of the data, and more importantly (from the user's perspective) if 2) the use of the downloaded data by the users (people training VLMs) may be subject to different terms of use / restrictions that are not directly stated anywhere, and may depend on different jurisdictions (e.g. can researchers legaly use this data to conduct research? both academic and industry researchers? can they release the models trained on this dataset?). The authors acknowledge this in the "ethical discussion" in appendix A3 (and other parts of the appendix A): "it is impractical to obtain explicit consent from all content creators". I personally agree with this statement, but I think it should be mentioned in the main paper.
  • I would appreciate a table similar to Table 5, but comparing the author's model train only on LAION (for instance) and OmniCorpus-CC, varying the number of the total number of tokens (e.g. text tokens + image tokens after encoding). This would be a proxy measuring the "quality" of both datasets, defined as "downstream accuracy that a token from the dataset provides". If the quality of the dataset proves to be relatively high, my soundness score would increase.
  • It seems that the authors worked on improving the support of other languages beyond Chinese and English (e.g. line 237: "we [...] enhanced its capability to handle Chinese, Japanse and Arabaic documents"), however they decided to include only Chinese and English documents at the end. This is a lost opportunity (and amount of work) to have a truly multilingual (and not bilingual) dataset.
  • As all the the other publicly available massive datasets, only the URL images are provided, which may difficult the reproducibility of the experiments conducted using it over time.

问题

  • How was the set of "Chinese Websites" decided?
  • I assume that the frequencies that appear in section 4 where obtained by manual inspection by the authors of the 200 randomly sampled documents, is that correct? or where they shipped to external evaluators?
评论

Q3: Lost opportunity for more language.

A3: In our project, human filtering is essential for quality assurance. Our data processing workers are currently only able to read and process Chinese and English, two of the most widely used languages. Hence, we process bilingual content for our dataset. The paragraph in question describes improvements to main body extraction, which indeed allows for more effective handling of multilingual documents. We hope to encourage researchers fluent in additional languages to further extend the filtering functions in future work. For this reason, we retained these extraction improvements to support further multilingual expansion.

Q4: About Image URL.

A4: Providing image files would significantly increase the cost of open-sourcing the dataset. Therefore, we follow common practices (e.g., OBELICS[1], LAION[2], and MMC4[3]) by representing images as URLs.

Q5: How was the set of "Chinese Websites" decided?

A5: We carefully selected Chinese websites from multiple legitimate and publicly accessible sources including platforms with clear Creative Commons agreements or similar open-content policies, public news media platforms, open Chinese article websites, and so on (e.g., Chinese Wikipedia, the Chinese Basic Corpus, and Chinese news platforms). We strictly avoided data with explicit restrictions on usage. This includes content containing personal information or privacy-related data (e.g., social media platforms like Weibo or online health question-answering sites) and content with strict copyright protections (e.g., CNKI or commercial databases). Our selection was guided by usage policies, ensuring all collected data will be legally and openly available for research purposes.

Q6: The qualitative assessment is obtained from whom?

A6: The qualitative assessment of the 200 samples was obtained by external evaluators.

Reference:

[1] Obelics: An open web-scale filtered dataset of interleaved image-text documents

[2] LAION-5B: An open large-scale dataset for training next generation image-text models

[3] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

[4] Visual Instruction Tuning

[5] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

[6] Towards VQA Models That Can Read

[7] Microsoft COCO Captions: Data Collection and Evaluation Server

[8] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

[9] CIDEr: Consensus-Based Image Description Evaluation

评论

Q3-Q6: Thanks for your clarifications. They adequately addressed my concerns, so they are not a limiting factor for increasing my score.

评论

We sincerely appreciate your valuable and constructive reviews.

In this response, we provide more implementation details and results of an additional experiment to strengthen the soundness.

To make efficient use of the extended discussion period and potentially incorporate more data, we transitioned to a more computationally efficient model architecture, OpenFlamingo-3B[1].

Implement details: We keep the average total number of tokens per batch to be consistent across datasets. Specifically, we first calculate the average effective total token count for the two datasets on OpenFlamingo-3B (using the function provided below). The inverse ratio of these values was used to determine the proportion of samples per batch (for rounding convenience in experiments, we allowed LAION-en-2B[2] to slightly exceed in token count). Finally, we adopt collect 2,048 OmniCorpus-CC documents or 22,016 LAION-en pairs for each step and train for 50k steps (equivalent to approximately 102M documents or 1.1B pairs), respectively. The model architecture utilizes CLIP ViT-L-14[3] and MPT-1B[4] as the backbone. The cross-attention interval is set to 1, and the learning rate is fixed at 1e-4[1]. Only the parameters of the cross-attention modules and the perceiver resampler are trainable, while all text embeddings (including the special tokens "<image>" and "<|endofchunk|>") are frozen. We set the warm-up step to 2,000. Furthermore, we leverage DeepSpeed ZeRO-1 and BF16 to accelerate training.

def get_openflamingo_num_total_tokens(N_img: int, txt_list: list[str], max_tokens: int, tokenizer):
    num_tokens_single_eoc = len(tokenizer.tokenize("<|endofchunk|>"))
    num_tokens_single_image = len(tokenizer.tokenize("<image>"))
    # each sentence is like "<bos>...<|endofchunk|><eos>"
    num_bos_eos_eoc_tokens = 2 + num_tokens_single_eoc
    num_total_image_related = (
        # the first image only have one <image>
        max(1, N_img) * num_tokens_single_image + 
        # the other images have <|endofchunk|><image>
        max(0, N_img - 1) * (num_tokens_single_eoc + num_tokens_single_image)
    )
    num_total_text = sum([len(tokenizer.tokenize(t)) for t in txt_list])
    return min(max_tokens, num_bos_eos_eoc_tokens + num_total_image_related + num_total_text)
评论

Q1: Concern on Terms of Use.

A1: Thanks for your valuable comments regarding the compliance and ethical considerations of our dataset.

In the revised manuscript, we have added clarifications regarding the terms of use to the 'Introduction' section.

The collection of OmniCorpus was conducted with strict adherence to the terms of use (ToU) of the data sources. We followed established practices (e.g., OBELICS[1], LAION[2], and MMC4[3]) to exclude websites that prohibit data usage (e.g., employing the Spawning API to respect consent decisions). Furthermore, we applied additional manual filtering to exclude high-risk domains, especially for Chinese websites. For instance, although some platforms (e.g., online health question-answering sites) are publicly accessible, we excluded them due to potential ethical and privacy risks. Additionally, parts of the data source originate from existing datasets (e.g., we annotate text for videos from established datasets in OmniCorpus-YT) whose compliance has been rigorously validated in prior work.

To address user concerns about compliance, we establish our ToU aligned with commonly accepted standards (such as ToU of OBELICS[1]). Users (whether from academia or industry) are expected to follow the CC-BY ToU and adhere to the ToU of the data sources (which are generally covered by the former). It is also required that any derived datasets or models should disclose the use of OmniCorpus for transparency. While we do not differentiate terms across jurisdictions, we encourage users to adapt the applicability of our dataset under their local legal frameworks.

We appreciate your thoughtful suggestions, which have greatly contributed to improving the clarity and ethical transparency of our work. We will also update the manuscript and dataset repository to ensure that these guidelines are easily accessible to all users. Thank you again for helping us make OmniCorpus more robust and responsible.

Q2: Comparison experiment of LAION data and our data.

A2: Thank you for your suggestion. We conduct an experiment to compare LAION[2] and OmniCorpus-CC while aligning the total number of tokens.

Specifically, we construct a LAION subset whose total valid tokens number aligns with that of the 1M OmniCorpus-CC subset. To be sound, only tokens within the maximum token length of the language model are considered valid. This process results in approximately 2.4M valid image-text pairs after filtering out images with invalid URLs or extremely small sizes (less than 10 pixels). We use the same LLaVA[4] architecture for pretraining on both subsets.

The results are exhibited in the following table.

#Shot0124
LAION only39.451.153.755.2
Ours only28.448.354.458.7

(Each score represents the average of OKVQA[5]&TextVQA[6] accuracies and COCO[7]& Flickr30k[8] CIDEr[9] scores.)

The comparison reveals that each token from OmniCorpus-CC demonstrates superior accuracy in the 2&4-shot settings, while each token from the LAION subset exhibit better performance in the 0&1-shot settings.

The superior performance of the LAION subset in the 0&1-shot settings can be attributed to its much higher image-to-text ratio. Since the LAION subset contains significantly more images compared to ours, it enhances performance in scenarios providing less context.

In contrast, ours outperforms LAION subset in the 2&4-shot settings, highlighting the higher quality and in-context learning potential of our documents. The richer contextual associations in our native interleaved documents allow the model to better understand and utilize the provided examples, resulting in improved performance as the number of shots increases.

We appreciate your suggestion, as it allowed us to demonstrate the strengths and trade-offs of both datasets more clearly. This comparison further underscores the value of OmniCorpus-CC for tasks that rely on in-context learning capabilities.

评论

Q1: Thank you very much for the clariftication. I think that the authors have done everything in their hands to provide the data under fair ToU.

Q2: The experiment that you ran is not exactly what I asked for.

You took a subset of LAION "whose total valid tokens number aligns with that of the 1M OmniCorpus-CC subset", trained a model in each subset, and then compared the quality of the two models under different few-shot scenarios (including zero-shot).

However, these two datasets are way bigger than 1M. So how does the comparison look like if we use 10M-equivalent subsets? What about 100M or 1B? This sort of comparison is very interesting because it tells the potential user which dataset provides more bits/token under different training budgets. And most likely, the potential users are interested in the larger data regimes.

I understand that this sort of experiment is quite compute-intensive, and might not be feasible to run it (I'll take that into account for my final decision), but it would highly influence the final score if one could show that OmniCorpus consistently outperforms LAION in terms of quality/token.

评论

Experimental results: The table below summarizes the comparative results. In most cases, the model trained on OmniCorpus-CC outperforms the one trained on LAION-en-2B[2]. This performance gap is more pronounced compared to the 1M-scale experiment with LLaVA-1.5[5]. We attribute this to differences in model architecture and the foundational models' language capabilities. OpenFlamingo-3B[1], which employs MPT-1B[4], a relatively small-scale model, benefits significantly from the richer textual content in OmniCorpus-CC’s interleaved image-text data. In contrast, LLaVA-1.5-7B[5], built on Vicuna-1.5[6], already possesses strong contextual reasoning capabilities. For LLaVA-1.5, effective alignment in the multimodal fine-tuning stage suffices to yield improvements, even with noisier datasets like LAION-en-2B.

Scaledataf0f4f8c0c4c8o0o4o8t0t4t8
500 steps / 250M tokensLAION1.852.643.921.221.481.885.446.077.273.093.683.70
(1m docs / 11m pairs)Ours2.663.694.622.534.845.6512.4014.0114.686.606.937.27
2500 steps / 1.25B tokensLAION4.5012.6117.264.9614.8518.195.934.222.873.683.623.66
(5m docs / 55m pairs)Ours9.0113.2413.0614.4729.7928.6715.9116.6317.156.606.907.05
5k steps / 2.5B tokensLAION3.9215.5820.075.1820.0626.465.575.214.204.914.394.88
(10m docs / 110m pairs)Ours19.1728.3828.9034.7552.5854.7618.4119.9420.768.428.789.08
10k steps / 5B tokensLAION7.1224.7626.559.7836.1437.056.685.875.056.125.895.44
(20m docs / 220m pairs)Ours26.5037.6837.3539.2563.3967.8421.7324.4124.4612.0112.6312.43
15k steps / 7.5B tokensLAION5.8321.9724.6310.0526.5124.236.574.632.987.314.834.60
(30m docs / 330m pairs)Ours26.6737.3536.1938.2665.2870.4123.1025.1325.5614.0514.1614.25
20k steps / 10B tokensLAION4.7820.5226.1511.8420.0321.447.404.092.867.284.143.72
(40m docs / 440m pairs)Ours28.6835.5740.1244.0258.0267.8321.0223.6025.5614.3515.5115.70
25k steps / 12.5B tokensLAION9.3126.9327.1616.2528.6224.947.934.914.608.914.624.14
(50m docs / 550m pairs)Ours28.4034.2037.8843.9357.2968.2522.6625.3526.9215.6116.6316.50
30k steps / 15B tokensLAION6.6523.9228.6913.277.408.248.624.763.867.724.864.07
(60m docs / 660m pairs)Ours26.8337.7642.0445.0564.6372.0824.1826.2027.5916.2417.0217.09
35k steps / 17.5B tokensLAION5.8524.7127.0811.4916.884.758.175.625.729.084.225.12
(70m docs / 770m pairs)Ours25.5033.4540.0634.9150.4765.1523.9127.6728.7216.6417.7218.06
40k steps / 20B tokensLAION10.9529.3228.4616.4628.9627.157.385.335.349.516.817.05
(80m docs / 880m pairs)Ours24.0036.5143.2738.3358.6772.2424.6426.9728.7516.4417.8017.10
45ksteps / 22.5B tokensLAION5.2324.8530.4212.2127.728.895.743.312.777.075.635.53
(90m docs / 0.99B pairs)Ours33.2435.4244.0749.7547.9955.1025.2127.3328.9715.9518.0818.41
50k steps / 25B tokensLAION9.8229.3328.847.5913.423.094.984.744.878.065.115.50
(100m docs / 1,1B pairs)Ours34.2839.9745.7956.2858.0662.3824.7327.8129.7017.0719.4419.88

("f" indicates Flickr30k[7], "c" indicates COCO[8], "o" indicates OK-VQA[9], "t" indicates TextVQA[10].)

评论

Due to the significantly larger text token count of the nearly 1B-scale OmniCorpus-CC dataset compared to the entire LAION-2B, we were unable to complete the comparison at the 1B scale. For future experiments involving larger datasets and models, we kindly ask for the reviewers' understanding, as we have fully utilized the available resources during the extended discussion period. We have rigorously validated the data quality at multiple reasonable scales and believe that the conclusions drawn from these experiments are extensible to even larger datasets and models.

Reference:

[1] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

[2] LAION-5B: An open large-scale dataset for training next generation image-text models

[3] Learning Transferable Visual Models From Natural Language Supervision

[4] MPT-1B

[5] Improved Baselines with Visual Instruction Tuning

[6] Vicuna 1.5

[7] Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

[8] Microsoft COCO Captions: Data Collection and Evaluation Server

[9] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

[10] Towards VQA Models That Can Read

AC 元评审

The paper proposes a very large dataset for MLLM research, including video. Its size, diversity and benchmarking form a significant contribution. Some concerns were raised about the ethics of the data collection and some aspects of the experiments, but seem well addressed. Three reviewers are strongly in favor of acceptance, and one weakly.

审稿人讨论附加意见

Reviewers engaged in discussion with the authors to a sufficient extent

最终决定

Accept (Spotlight)