Meta CLIP 2: A Worldwide Scaling Recipe
We generalize CLIP training to worldwide web-scale, with +0.8% better than English only counterpart on zero-shot ImageNet classification (no compromise), SoTA on zero-shot multilingual: 57.4% on CVQA and 50.2% on Babel-ImageNet.
摘要
评审与讨论
- The authors have promoted the research on multilingual CLIP, curated global data, and trained standard CLIP towers, ensuring the models' effectiveness and ease of use.
- They provide detailed description of their data curation pipe.
优缺点分析
In my opinion, this article is of great significance. In recent years, our visual language models have consistently focused on a few languages used by a huge number of users. Promoting their scaling to a worldwide level can help artificial intelligence move towards globalization and acquire more extensive world knowledge, etc.
- Findings about the curse of multilinguality (Multilinguality CLIP models get worse English performance than English-only CLIP).
- A detailed data pipe, the pseudo code is clear.
- The experiment is well organized.
Weakness:
- It is still not well explained about why the ViT-H break the curse of multilinguality but ViT-L suffers? Is this wholely because of model size? Is there a threshold to judge whether a model suffers from the curse.
- Due to the curse of multilinguality, the paper mainly focuses on large models. But the ViT-L is still applied by much downstream models (such as many VLMs like LLaVA).
问题
-
Is the multilingual data currently collectable by Common Crawl sufficiently representative? I assume that even if we do not discard the non-English parts of the data collected from Common Crawl, the non-English content may still have significant bias. I would like to ask the authors how they view this situation.
-
It seems challenging to treat data in different languages equally. The population gaps between different languages are substantial, and the amount of data produced is highly correlated with factors such as population and regional economy. This seems to form a long-tail phenomenon at the multilingual level. I would like to ask the authors how they view the inequality (long-tail) of corpus across different languages.
局限性
-
It seems that due to considerations of collection difficulty, policies, and licenses, the authors chose data sources like Common Crawl (consistent with the multilingual version of LAION) and so on. However, I believe there exists a big amount of data that is not included in Common Crawl but can better reflect the characteristics of the worldwide context. Many data from platforms actually used by users in their countries remain unadopted.
-
I am not sure whether the author has had the opportunity to communicate with users of other languages. I am concerned that the current metrics may not be sufficient to directly reflect how users of other languages feel about the model. When I conduct data research myself, I feel anxious that members of our team can basically only speak no more than three languages. I wonder whether the authors can see the changes of the proposed model in other languages beyond the hard metrics (because I think that especially for long-tail languages, the changes brought by worldwide recipes should be quite significant).
最终评判理由
I keep the score and confidence. The authors have resolved all questions, including data sources, evaluations, and the details about 'the curse of multilinguality'. Besides, they share their opinions about language bias.
格式问题
none
We sincerely appreciate your thorough feedback and thoughtful comments, including recognizing that 'this article is of great significance,' that 'promoting scaling to a worldwide level can help artificial intelligence move toward globalization and acquire broader world knowledge,' and that 'the experiment is well organized.' Below, we address the profound questions you raised:
“It is still not well explained about why the ViT-H break the curse of multilinguality but ViT-L suffers? ... (removed for length limit)”
Yes, it is because the model size cannot hold the added information from non-English training distribution. To predict when and whether a model suffers from the curse, we conducted a “tiny” scaling law experiment. In the experiment, with a smaller model, we gradually increase the sizes of training data and search for the first size that leads to the model learning saturated (i.e., the model trained with English and Worldwide data yields the same performance on ImageNet if not saturated yet, or the curse starts appearing if saturated). We found the sizes are ~800M Worldwide pairs (or ~320M English pairs) for ViT-B/16 (~86M parameters). Combined with our previous experiments where we found ViT-L/14 (~307M) saturates with around 2.5B/1B Worldwide/English pairs, and 2.5B/1B Worldwide/English pairs for ViT-L/14 (~307M) roughly fit these data points to a linear line for the “scaling law”, and obtain the rough estimate for ~8 pairs per parameter to break the curse. This well predicts that when we scale the training data to match frontier efforts (i.e., 2B English pairs, and correspondingly 5B Worldwide), ViT-H/14 (~632M) is needed. Note this estimation is very rough (only 2 data points to fit a linear line) due to time constraint in rebuttal, specific to our experiment setup, and dependent on factors such as data processing, model architectures, and training schedules. All in all, this empirical study explains why the curse of multilinguality impacts smaller models more severely: it reflects a scaling constraint, where the model lacks sufficient capacity to fully absorb and generalize from the multilingual signals present in large-scale training data.
“Due to the curse of multilinguality, the paper mainly focuses on large models. ... (removed for length limit)”
Our primary goal is to study the existence of the curse of multilinguality and identify a critical point at which the curse can be broken. Our work serves as a proof of concept using the standard OpenAI CLIP training setup—where English data has already reached a scale of over 2 billion samples. This work offers valuable guidance and new opportunities for future research: (1) it suggests avoiding training smaller models with multilingual data from scratch due to model capacity; and (2) it offers a ViT-H/14 model that no longer suffers from the multilinguality curse to enable alternative approaches, such as model distillation to effectively train smaller models. (e.g., we quickly put together an experiment to distill B/32 from H/14 during this rebuttal and our initial result shows the distilled B/32 achieves 65.5+% on ImageNet, compared to 64.7% when training B/32 from scratch). Releasing our model and data algorithms is the 0 to 1 step to unlock all these future directions.
Also note that, we adopt the standard OpenAI CLIP training setup in our recipe to explore the critical point for our findings being generalizable. The training setup is definitely not the most compact way of using model capacity and thus it’s very likely more efficient learning algorithms and model architectures can be developed from our work, and the curse will be broken with smaller model sizes.
“Is the multilingual data currently collectable by Common Crawl sufficiently representative? ... (removed for length limit)”:
One of our major goals is to promote open research, and selecting a CommonCrawl-based data pool enables the community to reproduce and compare results using open models. Similar to our selection of OpenAI training setup, we acknowledge that our CommonCrawl data pool is not optimal in terms of language and content biases (e.g., CommonCrawl crawler may be built mostly by developers from the Western world, the popularity of content in different languages may be related to social or economic reasons that are not ideal for models to pick up). We believe this is an important question and potentially multiple valuable research projects can grow from. For example, crawler or data sources other than CommonCrawl can be used to augment the data pool and control/mitigate biases introduced by CommonCrawl. However, doing so adds non-trivial complexity in science, engineering, legal compliance etc., and raises the barriers for open research. We consider our open model and data algorithms to serve as a baseline and foundation to spark such research.
“It seems challenging to treat data in different languages equally. ... (removed for length limit)”
This is a great question. During the pre-training stage, the target is to enable models to learn general capability and comprehensive knowledge, and our goal here is to present a proof-of-concept recipe of breaking the curse of multilinguality. Thus, we design our algorithm so that the language distribution in pre-training data is minimally manipulated and reflect the true language distribution observed in the real world (e.g., ratio of English vs. non-English speakers or occurrences of examples). Although the observation is approximated by the web traffic, which is not ideal, we believe such scaling with the web introduces minimal inductive biases, helps resolving other biases (such as social-economic ones as you pointed out), and enables the pre-trained model to learn general capabilities. See responses to the next question for more related thoughts.
Our data distribution may not be optimal for long-tailed languages, which are affected by inequality, bias, or other reasons as you indicated. We believe it’s more suitable to handle such skew during fine-tuning (as an example experiment described below). Meanwhile, a broader challenge is that current benchmarks may not fully capture the bias and reflect the true importance of languages in long-tails (see more discussion in the section of “Limitation on Benchmark” in supplementary materials).
To show the feasibility of handling the inequality and long-tailed languages with fine-tuning, we design an experiment where we use the pre-trained MetaCLIP worldwide model, and further fine-tune the model on a dataset emphasizing long-tailed languages (we downsample top-10 languages in training set to the same portion as the 11th’s language). With just 5k steps of fine-tuning, we see the following boost on low-resource languages from CVQA
| Language | Before | After |
|---|---|---|
| Bulgarian | 57.1 | 58.5 |
| Malay | 62.9 | 67.6 |
| Igbo | 34.0 | 42.0 |
| Oromo | 29.4 | 36.4 |
| Hindi | 66.2 | 70.6 |
| Sinhala | 43.1 | 46.7 |
| Mongolian | 33.3 | 34.3 |
“I am not sure whether the author has had the opportunity to communicate with users of other languages. ... (removed for length limit)”
That's a very good point. As noted in our supplementary material, we are concerned that many current benchmarks rely on translating English data into non-English languages, which fails to capture the cultural diversity inherent to native non-English data and thus limits a full evaluation of our model’s capabilities. We believe establishing a more comprehensive benchmark for long-tail languages is very needed. While creating such benchmarks is beyond the scope of this work, our pretraining recipe is designed with minimal assumptions, that are usually introduced with multiple hand-crafted or model-based filtering mechanisms, to avoid introducing bias. Also, we believe our work encourages the community to build more accurate benchmarks for long-tail languages and facilitate development for downstream use cases in different languages by fine-tuning our pre-trained models.
Also we appreciate your sharing of the anxiety on researchers’ bias and we had a similar feeling from our previous experience on domain biases (e.g., in designing data curation algorithms, we had explored many assumptions to better cover vast and diverse human knowledge, from one extreme of enumerating data sources to diversify domains in data to another where “no filtering” philosophy was adopted [1]). We learned from the bitter lesson that often it’s better for researchers to have the least involvement in assumption making, which introduces inductive biases of particular interests, and to leave the de-biasing and intelligence boost to scaling (i.e., the wisdom of the crowd). We found that in many cases, proper scaling itself should automatically resolve most biases. Thus, not knowing all languages gave researchers the advantage to avoid designing methods for particular languages and to focus efforts on scaling. On the contrary, building benchmarks should require more domain expertise and be more vulnerable to such biases. That’s why we called out the community should invest more on evaluations.
[1] Pouget, Angéline, et al. "No filter: Cultural and socioeconomic diversity in contrastive vision-language models." Advances in Neural Information Processing Systems 37 (2024).
Thanks for the authors' response. I currently have no more question.
We are grateful for your time and consideration in reviewing our rebuttal. Your engagement is truly appreciated.
This paper presents MetaCLIP 2, a comprehensive "recipe" for training Contrastive Language-Image Pre-training (CLIP) models on web-scale, worldwide (i.e., multilingual) data from scratch. The authors identify and address two key challenges in scaling CLIP beyond English:
- The lack of a transparent, scalable data curation method for non-English data.
- The "curse of multilinguality," where adding non-English data often degrades the model's performance on English-centric benchmarks.
The proposed recipe consists of three main innovations:
- Worldwide Metadata: Scaling the metadata used for curation from English-only sources (WordNet, Wikipedia) to their multilingual equivalents, covering over 300 languages.
- Worldwide Curation Algorithm: A language-aware curation process that performs per-language substring matching and balancing. A key contribution here is a method to dynamically determine a balancing threshold (t_lang) for each language to ensure a consistent ratio of head-to-tail concepts across all languages.
- Worldwide Training Framework: A training methodology that scales the number of seen image-text pairs proportionally to the increase in data size from non-English sources and identifies a "minimal viable model capacity" (ViT-H/14) required to overcome the curse of multilinguality.
Through extensive experiments, the authors demonstrate that their full recipe not only mitigates the curse of multilinguality but reverses it, showing that English and non-English data can be mutually beneficial. The resulting MetaCLIP 2 ViT-H/14 model surpasses its English-only counterpart on ImageNet and sets new state-of-the-art results on several challenging multilingual benchmarks, all without relying on private data, machine translation, or model distillation.
优缺点分析
Strengths:
- Significance and Timeliness of the Problem: The paper tackles a highly relevant and critical problem. As English-language data for training foundation models approaches exhaustion, developing principled methods for leveraging the vast, multilingual web is essential for future scaling. This work provides a concrete and reproducible path forward.
- Principled and Transparent Methodology: The "recipe" framing is a major strength. Instead of an ad-hoc collection of tricks, the authors present a systematic, step-by-step approach built upon the well-understood MetaCLIP framework. The commitment to using only public data (Common Crawl) and avoiding black-box filters or distillation is a significant contribution to transparency and reproducibility in the field. The method for deriving language-specific balancing thresholds (t_lang) is particularly clever and well-motivated.
- Rigorous and Comprehensive Experimentation: The experimental validation is exceptionally thorough.
- Clarity and Presentation: The paper is very well-written and easy to follow.
Weaknesses:
- The "Curse" is Broken Only at a Very Large Scale: The pivotal finding of the paper is that the curse of multilinguality is an issue of scale. However, this breakthrough is only achieved with a ViT-H/14 model, which is computationally very expensive and inaccessible to most of the academic community. The fact that the curse persists even with a ViT-L/14 model (the largest size in the original CLIP paper) is a significant limitation. While the authors are transparent about this, it does temper the immediate practical applicability of the "recipe" for those with limited resources.
- Complexity of the Full Recipe: While the individual components are well-explained, the full recipe is non-trivial to implement. It requires constructing massive multilingual metadata, running language identification, performing efficient large-scale substring matching, and managing a complex data balancing scheme across hundreds of languages. The best-performing tokenizer also has a very large vocabulary (900k), which adds to the memory and computational overhead. While the authors promise to release code, the barrier to entry for reproducing or adapting this work remains high.
问题
-
The finding that ViT-H/14 is the "inflection point" to break the curse is fascinating. Could you speculate further on the underlying reasons? Is it purely a matter of parameter count for storing diverse linguistic and visual concepts, or might it be related to other architectural properties of larger ViT models? Have you considered if alternative strategies, such as extending the training duration for the ViT-L/14 model, could potentially yield similar benefits?
-
The proportional scaling of seen pairs (2.3x) appears to be a crucial ingredient in the recipe. This is justified by maintaining the number of seen English pairs. Was this scaling factor ablated? It would be insightful to know if there's a sensitivity to this hyperparameter—for example, if a 1.5x scaling is insufficient or if a 3.0x scaling offers diminishing returns or even harm.
-
Regarding the multilingual tokenizer (Table 3), the 900k-token XLM-V performs best. This is quite large compared to standard tokenizers. Could you comment on the performance drop if a more moderately sized (e.g., ~250k) but still robust multilingual tokenizer was used with the final ViT-H/14 setup? How critical is this massive vocabulary to the overall success of the recipe?
局限性
yes
最终评判理由
I have carefully considered the paper and the authors' rebuttal. The authors have thoughtfully addressed my concerns. After thorough evaluation, I have raised my recommendation to Accept (5).
格式问题
no
We sincerely appreciate your thorough review and recognition of our work in areas such as 'Significance and Timeliness of the Problem', 'Principled and Transparent Methodology', and 'Rigorous and Comprehensive Experimentation'. Below, we address the insightful questions you raised:
“The "Curse" is Broken Only at a Very Large Scale: The pivotal finding of the paper is that the curse of multilinguality is an issue of scale. ... (removed for length limit)”
Our primary goal is to study the existence of the curse of multilinguality, and present a proof-of-concept showing the curse can be broken with common and standard setting in the community (e.g., OpenAI CLIP training setup, 2+ billion English training data). ViT-H/14 marks the first critical point found to break the curse. We choose this standard setup for fair comparison across existing data algorithms and generalizability of our findings. With the critical point we found, this work offers important insights and new opportunities for future research: (1) scaling a multilingual CLIP from scratch with small model capacity is impractical to match existing training scale for English; and (2) the work offering a ViT-H/14 model that no longer suffers from the multilinguality curse to enable alternative and more practical approaches, such as model distillation to effectively train smaller models (e.g., we quickly put together an experiment to distill B/32 from H/14 during this rebuttal and our initial result shows the distilled B/32 achieves 65.5+% on ImageNet, compared to 64.7% when training B/32 from scratch). Releasing our model and data algorithms is the 0 to 1 step to unlock all these future directions.
We also want to point out that the Standard OpenAI CLIP training setup is not the most compact way of using model capacity and thus it’s very likely more efficient learning algorithms and model architectures can be developed from our work, and the curse may be broken with smaller model sizes.
“Complexity of the Full Recipe: While the individual components are well-explained, the full recipe is non-trivial to implement. ... (removed for length limit)”
This is a great point. The scaling from intra-language (e.g., English only) to inter-language (worldwide) naturally introduces non-trivial complexity. One of our major contributions is to significantly reduce this complexity with a minimal recipe for this scaling, and making it transparent and accessible to the research community, so that the entry barrier of multilingual CLIP can be greatly reduced. Specifically, we will release our preprocessed metadata so that the community can avoid re-processing large corpora like Wikipedia; we adopt the Aho–Corasick algorithm to accelerate substring matching, which is a preprocessing computation bottleneck, by over 2000 times, and we open source our implementation. Furthermore, by including non-English data in CLIP training, we offer a compact solution to only train one model with a 1.3x increase of its original data volume, but achieving competitive performance in over 300+ languages. On the other hand, this work is part of the recent community endeavors of building foundation models (e.g., Llama, Qwen, SAM, SigLIP). Developing data for foundation is intrinsically complicated. For example, several components in our algorithm, such as language identification (LID), are common across other foundation models even for English-only CLIP or LLMs.
Exploring the choice of tokenizers (Table 3 in main paper) and later understanding the limitation (in appendix) is also important to reduce the complexity. We believe distilling smaller tokenizers, which is also discussed in our limitation section in appendix, can greatly improve compactness while maximally preserving performance. Our release of model weights will greatly accelerate the community development on this front.
Lastly, in contrast to the recent trends where big labs turn into private mode (e.g., only release model API or opaque about their implementation), we believe our open source efforts, including models, code, intermediate artifacts, significantly reduce the inherent complexity of scaling to worldwide data. By making our data and methods available now, we enable not only current research but we envision to work with the community hand-in-hand on other critical challenges including complexity.
“The finding that ViT-H/14 is the "inflection point" to break the curse is fascinating. ... (removed for length limit)”
Yes, the major root for curse is the parameter count, or model capacity. The more than doubled data volume, after extending English only data to worldwide, contains much more information (as you mentioned, the diverse linguistic and visual concepts) than a L/14 can hold. Our finding is based on rigorous ablation with the English counterpart trained via standard OpenAI recipe, which is deliberately selected for our conclusion being generalizable. Extending the duration of the worldwide model training alone deviates from the recipe and breaks the ablation: training with more English seen pairs in worldwide data should be compared with an English-only model also with more seen pairs, which is likely also achieving better performance. However, we do see if we relax the constraint on recipe (e.g., data distribution, model architectures, training setup), there should be room for efficiency such as downsample the English examples in data to make room for non-English concepts, model pruning/distillation, more sophisticated design of learning rate, training schedule or losses, or combinations of these. With such a setting, curse might be broken at smaller model size (e.g., the ViT-L/14 that is popular in community), but curse always exists model capacity always kicks in to decide how much a model can learn.
Furthermore, there’s a trend of linear scaling law found in a “tiny” experiment we conducted. In the experiment, with a smaller models, we gradually increase the sizes of training data and search for the first size that leads to the model learning saturated (i.e., the model trained with English and Worldwide data yields the same performance on ImageNet if not saturated yet, or the curse starts appearing if saturated). We found the sizes are ~800M Worldwide pairs (or ~320M English pairs) for ViT-B/16 (~86M parameters). Combined with our previous experiments where we found ViT-L/14 (~307M) saturates with around 2.5B/1B Worldwide/English pairs, we roughly fit these data points to a linear line for the “scaling law”, and obtain the rough estimate for ~8 pairs per parameter to break the curse. This well predicts that when we scale the training data to match frontier efforts (i.e., 2B English pairs, and correspondingly 5B Worldwide), ViT-H/14 (~632M) is needed. Note this estimation is very rough (only 2 data points to fit a linear line) due to time constraint in rebuttal, specific to our experiment setup, and dependent on factors such as data processing, model architectures, and training schedules. All in all, this empirical study explains why the curse of multilinguality impacts smaller models more severely: it reflects a scaling constraint, where the model lacks sufficient capacity to fully absorb and generalize from the multilingual signals present in large-scale training data.
“The proportional scaling of seen pairs (2.3x) appears to be a crucial ingredient in the recipe. ... (removed for length limit)”
We did conduct such ablation. We experimented with both 1.5× and 3.0× scaling for H/14, and observed that 1.5x just break the curse (c.f., English only H/14 with 80.4 IN) and 3.0x consistently yields more gain:
| Scaling Proportion | ImageNet | Babel-IN | XM3600 | CVQA |
|---|---|---|---|---|
| 1.5x | 80.5 | 50.0 | 63.1 | 56.6 |
| 2.3x | 81.3 | 50.2 | 64.3 | 57.4 |
| 3.0x | 81.4 | 50.4 | 64.5 | 57.9 |
One thing to note is that this ratio is a constant decided by the English vs non-English ratio in the data pool. We hold this constant to make the worldwide model comparable with the English model. Changing the ratio (as a hyperparameter) downsamples or upsamples English data and thus changes the English baseline. One can still tune it as a hyperparameter as we showed here based on their downstream use cases to emphasize English or non-English performance; this reduces generalizability that this paper aims to hold.
“Regarding the multilingual tokenizer (Table 3), the 900k-token XLM-V performs best. ... (removed for length limit)”
This is a great question. We did start with moderate sized tokenizers but found suboptimal results. Since we work on worldwide scaled (300+) languages, smaller tokenizers are not able to comprehensively cover all languages to deduct representative multilingual text embeddings, and thus the model shows compromised results in a subset of languages . For example, we trained a ViT-H/14 with an mT5 tokenizer (250k tokens), which is used by mSigLIP. The English performance is the same as our ViT-H/14 with the 900k tokenizer. However, the model drops in Babel-IN accuracy by 1.4% (48.8%) and XM3600 by 2% (62.2%). The drop is not consistent across all languages. In fact, as mentioned above, we do see there is plenty of room to make our recipe more compact due to our emphasis on generalizability and the standard OpenAI CLIP setting. So, we do expect one can distill more compact token embedding and smaller tokenizers with our released model, and we will also release our findings with distilling smaller tokenizers in the final version of this work to show the feasibility of this path.
Thank you for the detailed response. I truly appreciate the effort you've made in addressing my concerns, and I have no further questions at this time.
Thank you again for your time and looking into our responses. Your review and suggestion is valuable for our work.
This paper presents a worldwide scaling approach for CLIP training, extending MetaCLIP's methodology from 500k english entries to 27M entries covering 300+ languages. Current CLIP models lack proper curation processes for non-English content or rely on models trained on private data sources. The authors address the curse of multilinguality in vision-language models and propose a curation framework for non-English data.
优缺点分析
Strenghts
- The paper addresses a relevant and timely problem. Current CLIP models lack proper curation processes for non-English content. The scale of the effort is substantial, expanding from 500k to 27M metadata entries represents significant engineering work that could benefit the community.
- The proposed curation algorithm extends MetaCLIP's head-tail balancing approach to the multilingual settings via a simple yet efficient language-specific threshold derived from english ratios.
- The identification of the curse of multilinguality is a noteworthy empirical finding.
Weaknesses
-
The main limitation is the incremental nature of the contribution. While scaling from English to multilingual is valuable, the core methodology closely follows MetaCLIP without significant algorithmic innovations.
-
The paper lacks a proper understanding of the curse of multilinguality (either theoretical or via scaling laws). The authors observe the phenomenon but provide limited analysis of why it occurs or how to predict when it breaks. An analysis of embedding spaces, or language intereference would strengthen the work.
-
Although the naming of the paper and the scale of the experiments suggest that the paper is from Meta, it would have been preferable if the authors did not put the logo of their company for their method (Figure 2.) which is a breach of anonymosity. Nevertheless, given the level of contribution, i'm leaning towards acceptance.
问题
-
The curse of multilinguality raises fundemental questions about the design choices of CLIP models. If multilingual training hurts performance at low scales, would language specific cLIP models outperform a large unified one? This trade-off between specialization and generalization deserves deeper investigation.
-
Is the curse of multilinguality also observed for mSigLIP? Or is it specific to the worldwide data proposed by the authors?
-
The authors mention that the code and model will be released, bu will the worldwide metadata and the entry probabilities/counts also be released?
局限性
yes
最终评判理由
The authors addressed all my concerns.
格式问题
Not concerned
We sincerely appreciate your comprehensive review and recognition of our work as addressing 'a relevant and timely problem' and highlighting 'The curse of multilinguality raises fundamental questions about the design choices of CLIP models'. Below, we address the insightful questions raised in your comments:
“The main limitation is the incremental nature of the contribution. While scaling from English to multilingual is valuable, ... (removed for length limit)”
Thank you for sharing your thoughts. Our contribution is on scaling CLIP in a very different dimension: from intra-language (English-only CLIP) to inter-language (all languages on the web). The two dimensions are intrinsically orthogonal in underlying assumptions, core research questions, implementation principles and challenges, and applicable methodologies. Existing English-only recipes such as OpenAI CLIP or MetaCLIP are not applicable to the inter-language case. Through identifying the curse of multilinguality, we raise this fundamental question to the multimodal representation learning community and demonstrate that it can be addressed through a minimal yet effective scaling recipe. To the best of our knowledge, this is the first time this issue has been explicitly investigated and resolved within the research community.
We also provide a detailed description of our approach and will release multilingual artifacts, which is the first time in the community, to benefit future academic research and reduce the engineering burden. Such emphasis on detail and transparency stands in contrast to a growing trend where large labs with substantial resources adopt increasingly opaque practices (especially with regard to data). We regard our work and openness not just as a practice, but a meaningful scientific contribution enabling future scaling of CLIP and advocating open research.
"The paper lacks a proper understanding of the curse of multilinguality (either theoretical or via scaling laws). ... (removed for length limit)"
This is a profound question. In this work, we focus on investigating the existence of curse at popular vision-language benchmarks and offering minimal viable recipes to break the curse. We believe formally defining the curse of multilinguality, proving the theory, and designing detection mechanisms can be multiple impactful research projects, and thus in the following we provide intuitive explanations to the existence of curse, and preliminary experiments to encourage future research.
The major root cause for the curse of multilinguality is the lack of model capacity to learn new capabilities (e.g., understanding new concepts, domains, languages). Thus, a plausible way to detect the existence of curse, as you indicated, is examining the model internal status of “language interference”. We hence design a “gradient conflict” analysis, inspired by PCGrad [1] which was originally proposed for multitask learning, and later extended to multilingual settings on XLM [2], for this purpose. Specifically, we used the XM3600 dataset, which contains image-text pairs in 36 languages, to compute gradients from our model checkpoints and analyze cross-lingual interference. We measured the cosine similarity between gradients from English examples and those from each non-English language, and then averaged these similarities across all non-English languages. Here, checkpoints are all pre-trained with Worldwide 29B data, and we take the checkpoints at the 16th (midway) and 32nd (final) epochs.
| Checkpoint | MetaCLIP2-L/14 | MetaCLIP2-H/14 |
|---|---|---|
| Midway (Epoch 16) | 0.508 | 0.688 |
| Final (Epoch 32) | 0.546 | 0.697 |
As we can see, the smaller models (L/14) have lower similarity, i.e., more interference and gradient conflicts when models attempt to learn from different languages , than the larger ones (H/14). With more interference, L/14 performs worse on English tasks when trained on multilingual data than on pure English, since the model needs to spend valuable training steps more in reducing the conflicts throughout training, instead of learning semantics from various languages. In contrast, the larger models (H/14) experience less gradient conflicts even at early training stages, which we believe enables the models to focus on learning from English and non-English properly so that knowledge from both sides are integrated and mutually beneficial, and thus break the curse.
Compared to gradient conflict which has the potential to detect the curse, discovering the scaling law behind is a common approach to predict when the curse will happen. We conducted a “tiny” scaling law experiment where, at smaller models, we gradually increase the sizes of training data and search for the first size that leads to the model learning saturated (i.e., the model trained with English and Worldwide data yields the same performance on ImageNet if not saturated yet, or the curse starts appearing if saturated). We found the sizes are ~800M Worldwide pairs (or ~320M English pairs) for ViT-B/16 (~86M parameters). Combined with our previous experiments where we found ViT-L/14 (~307M) saturates with around 2.5B/1B Worldwide/English pairs, we roughly fit these data points to a linear line for the “scaling law”, and obtain the rough estimate for ~8 pairs per parameter to break the curse. This well predicts that when we scale the training data to match frontier efforts (i.e., 2B English pairs, and correspondingly 5B Worldwide), ViT-H/14 (~632M) is needed. Note this estimation is very rough (only 2 data points to fit a linear line) due to time constraint in rebuttal, specific to our experiment setup, and dependent on factors such as data processing, model architectures, and training schedules. All in all, this empirical study explains why the curse of multilinguality impacts smaller models more severely: it reflects a scaling constraint, where the model lacks sufficient capacity to fully absorb and generalize from the multilingual signals present in large-scale training data.
Due to time constraints, these are preliminary studies aiming at offering insights. With the release of our paper, model and code, we hope to enable the community to conduct more comprehensive investigations into the mechanisms underlying the curse of multilinguality in CLIP-style models.
"Although the naming of the paper and the scale of the experiments suggest that the paper is from Meta, ... (removed for length limit)"
We appreciate your attention to maintain a high standard for paper review. We have removed that accordingly in our current/future preprints.
“The curse of multilinguality raises fundemental questions about the design choices of CLIP models. If multilingual training hurts performance at low scales, would language specific cLIP models outperform a large unified one? This trade-off between specialization and generalization deserves deeper investigation.”
This is a very insightful question. Existing evidence shows that B16/L16 SigLIP (English only) is worse than SO400M mSigLIP (76.7/80.5 vs. 80.7 on ImageNet, from SigLIP 1/2 paper), while B/16 SigLIP outperforms B/16 mSigLIP (76.7 vs. 75.1 on ImageNet). This suggests, at small scale (i.e., the size smaller than where models start breaking the curse, in this paper, H/14), the specialization helps when the compute/model size is fixed(B/16), but a specialized model(L/16) underperforms a bigger, unified one(SO400M). It’s exactly one of our contributions that we show with quantitative evidence that there exists a critical point (again H/14 in this paper) where at/beyond this point, English and non-English data can start to benefit each other and an unified model can outperform a specialized one with the same size, since the language-specific models may forgo the advantages of cross-lingual optimization. Note that, we don’t have performance for all model sizes for the above comparison (e.g., L16 mSigLIP is missing in the original paper) and level of model sizes (i.e., B16, L16, SO400M) are discretized. This should affect where the exact critical point is, e.g., some models larger than L16 but smaller than H14 might break the curse, but we believe our argument is still valid at different critical points.
Also, with the unified model above the critical point (i.e., where the curse is broken) being trained,, it can serve as a teacher for distilling smaller unified CLIP models and expects better performance than from scratch (e.g., we quickly put together an experiment to distill B/32 from H/14 during this rebuttal and our initial result shows the distilled B/32 achieves 65.5+% on ImageNet, compared to 64.7% when training B/32 from scratch).
“Is the curse of multilinguality also observed for mSigLIP? Or is it specific to the worldwide data proposed by the authors?”
Yes. As mentioned in the previous question, for B/16, SigLIP (English) and mSigLIP (multilingual) achieve 76.7 and 75.1 accuracy respectively on ImageNet. For SO400M, SigLIP and mSigLIP achieve 82.2 and 80.7 accuracy respectively on ImageNet. Both results show that mSigLIP (trained for multilinguality) yields worse performance in English tasks (ImageNet) than its counterpart (SigLIP).
“The authors mention that the code and model will be released, bu will the worldwide metadata and the entry probabilities/counts also be released?”
Yes, we will release the metadata and their counts and probabilities. We believe releasing these intermediate artifacts greatly reduces engineering burden and encourages open research.
[1] Yu et al., Gradient Surgery for Multi-Task Learning, NeurIPS 2020
[2] Wang et al., On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment, EMNLP 2020
Thanks for the authors' response. Most of my concerns have been addressed. I encourage the authors to include these experiments in the final version of the paper.
- Regarding the gradient alignment experiment, this might hint that the larger model learns better language-agnostic representations which results in less harmful gradient updates from different languages.
- "we quickly put together an experiment to distill..." Is the distilled model also breaking the curse of multilinguality?
Thank you for taking the time to review our rebuttal. We will include these experiments in the final version.
- We think your explanation is reasonable and the better learning could be enabled due to larger model capacity.
- While the distillation training is still in progress, current trends show improvements over training from scratch on ImageNet. This suggests that distilling from a big model with stronger cross-lingual alignment could help to break the curse on a model smaller than ViT-H/14. This is an estimation and we are waiting for the final results to confirm the smallest model size which breaks curse with distillation training.
Thank you, all my concerns have been addressed. I encourage the authors to share the final results of the distillation experiment if the rebuttal window allows it.
Thank you again for the review and suggestion. We will definitely share the results once we have reached conclusion with the experiments. The distillation takes computation resources and time as well and the timeline probably won't fit into the rebuttal schedule though.
This paper proposes MetaCLIP-2, which is extended from English to multiple languages compared to the previous version of MetaCLIP. This work proposes a dynamic threshold t_lang adjustment method based on English data to balance the concept distribution. It achieves good results on non-English multilingual language benchmarks, while maintaining comparable performace to previous methods on English language tasks. This paper conducts detailed ablation experiments to prove that English data can help improve non-English, and the model performs well in cultural diversity tasks.
优缺点分析
Strengths
-
MetaCLIP-2 is currently the best multilingual image-text alignment model, which will benefit many domains, such as image generation or cross-regional cultural integration research.
-
It provides a complete solution from metadata construction, multilingual training, and dynamic filtering of training data based on the different languages.
-
Detailed experimental evaluation and ablation experiments prove its effectiveness.
Weakness
-
Although the MetaCLIP-2 shows good performance, it is more due to engineering optimization and massive resources, such as data and computing resources, which makes this paper more like a technical report than an academic paper.
-
The 6% tail concept ratio is based on the experience of English data and may not be optimal for low-resource languages.
问题
- The 6% tail concept ratio t_en is based on the experience of English data. Has the optimality of this ratio been verified for low-resource languages (such as African languages)? How to alleviate the filtering bias caused by the difference in data volume between languages?
局限性
yes
最终评判理由
After thoroughly reviewing the rebuttal and the manuscript, I believe all my concerns have been properly addressed, and I don't have any further questions.
格式问题
none
We sincerely appreciate your comprehensive review and positive comments, including recognizing our work as 'the best multilingual image-text alignment model', 'a complete solution', and highlighting our 'detailed experimental evaluation'. Below, we address the reflective questions you raised and hope our responses clarify your concerns:
“Although the MetaCLIP-2 shows good performance, it is more due to engineering optimization and massive resources, such as data and computing resources, which makes this paper more like a technical report than an academic paper.”
This work studies the scaling of inter-language data (which poses brand new challenges not exist in intra-language, e.g., English only, scaling) and the curse of multilinguality that emerges in inter-language scaling. We present a proof-of-concept recipe, with only necessary changes and minimal viable model capacity, showing that ViT-H/14 is the first critical point that breaks the curse. All these mark the 0 to 1 change as the spirit of academic paper, which enables further science discovery to further understand the curse and improve solutions. We totally agree that training a worldwide CLIP model involves non-trivial engineering challenges. We provide a solid solution to these challenges, which have largely remained unaddressed in the community, as the foundation for future academic research and lower the implementation barrier. Specifically, we deliberately minimize CLIP model architecture changes to reduce engineering complexity and concentrate solely on data research. We believe our recipe will pave the way for future studies to continue improving the model performance.
Also, we strive to share comprehensive implementation details on purpose to reduce the barrier of scientific exploration and encourage the application of open research principles. We believe our emphasis on details and transparency will empower the community to accelerate innovation. Our work stands in contrast to a growing trend where large labs with substantial resources become opaque and private—particularly regarding their data. Our openness is not just a practice, but a scientific contribution in itself.
“The 6% tail concept ratio t_en is based on the experience of English data. Has the optimality of this ratio been verified for low-resource languages (such as African languages)? How to alleviate the filtering bias caused by the difference in data volume between languages?”
This is a very insightful question. Being agnostic to data volume is also our goal (line 191-195 in the main paper) and that’s why we didn’t use t_en for all languages but use p (6%, ratio of tail-to-all concepts) since it is volume independent. The assumption behind is that there is a “sweet spot” that balances new and common concepts that humans perceive everyday to make most people feel comfortable. Too many common concepts make humans bored while too many new ones deplete human’s cognitive capacity and effective learning can’t be achieved. Although this assumption requires rigorous experiment in cognitive science to verify, this assumption seems empirically verified as beneficial in previous works on English data and CLIP models (where “sweet spot” is used to decide whether a concept is from long-tail and interesting enough to upsample). Compared to setting a hard limit t_lang for each language, this assumption offers a language agnostic way to guide curation of distribution across languages.
During the pre-training stage, the target is to enable models to learn general capability and comprehensive knowledge, plus the goal of this work is to present a proof-of-concept recipe for breaking the curse of multilinguality. Thus, we design our algorithm to handle language one-by-one, so that the language distribution in pre-training data preserves and truly reflects the phenomena observed in the real world (e.g., ratio of English vs non-English speakers or example occurrences on the web, the general interests in humans to balance new and common concepts). With this design, the pre-trained model learns general capabilities such as semantic understanding and vision-language alignment.
It’s very likely such design is not optimal for low-resource languages and may inherit biases on the web (e.g., certain languages or cultures are under-represented due to social or economic reasons). Such bias is more suitable to be handled during fine-tuning (as an example experiment described below) as tuning another upsampler for low-resource languages at pre-training stage may over-complicate the data algorithm and reduce generalizability. Meanwhile, we also observe a broader challenge is that current benchmarks may not fully capture the bias of low-resource languages (see more discussion in the section of “Limitation on Benchmark” in supplementary materials), and we believe it takes community efforts to improve iteratively (i.e., better data algorithm such as this work encourages better benchmarks, which further motivate more improvement in algorithms).
To show the feasibility of handling the bias with fine-tuning, we design an experiment where we use the pre-trained MetaCLIP worldwide model, and further fine-tune the model on a dataset emphasizing low-resource languages. We downsample each of the top-10 languages in the training set (comprising 94% training data) to the same portion as the 11th language. With just 5k steps of fine-tuning, we see the following boost on low-resource languages from CVQA:
| Language | Before | After |
|---|---|---|
| Bulgarian | 57.1 | 58.5 |
| Malay | 62.9 | 67.6 |
| Igbo | 34.0 | 42.0 |
| Oromo | 29.4 | 36.4 |
| Hindi | 66.2 | 70.6 |
| Sinhala | 43.1 | 46.7 |
| Mongolian | 33.3 | 34.3 |
We further ablate various ratios (3%, 6%, 10%) for building worldwide data with B/32 (due to time constraint in rebuttal, we can only work on B/32 scale and the full 29B training schedule won’t fit into this period).
| p | Babel-IN | XM3600 | CVQA (EN/LOCAL) |
|---|---|---|---|
| 3% | 33.7 | 41.2 / 53.7 | 51.0 / 48.1 |
| 6% | 33.3 | 41.6 / 53.9 | 50.4 / 47.7 |
| 10% | 33.0 | 41.6 / 53.7 | 50.3 / 48.4 |
As discussed above, we can see there is no common trend in downstream (Babel-IN, XM3600, CVQA) performance when tuning the ratio in different directions, which suggests more sophisticated tuning is required for various downstream use cases. Such tuning is more suitable to be applied at the fine-tuning stage.
Thank you for the detailed response. I appreciate the effort put into addressing my concerns. After thoroughly reviewing the rebuttal and the manuscript, I believe all my concerns have been properly addressed, and I don't have any further questions.
Thank you for taking the time to review our rebuttal. We appreciate your thoughtful feedback and engagement.
This paper addresses an important challenge in CLIP models, where training on real-world multilingual data typically degrades English performance compared to English-only models, preventing effective utilization of worldwide datasets. The authors propose a successful recipe for training CLIP from scratch on native worldwide image-text pairs across 300+ languages. The main approach includes 3 key components: (1) scaling metadata from 500k English entries to 27M multilingual entries, (2) implementing language-specific curation algorithms with adaptive thresholds, and (3) designing a training framework that scales seen pairs and requires sufficient model capacity. On the experimental side, the authors also provide comprehensive validation on various tasks and achieved new state-of-the-art results on multilingual benchmarks.
优缺点分析
Strengths
- This work addresses a fundamental limitation in current CLIP training that wastes a large portion of worldwide web multilingual data, e.g., English-only models perform better than multilingually trained data.
- This work proposes a simple yet effective recipe by metadata scaling, curation algorithm, and training framework to achieve success with CLIP models (trained from scratch) that outperform existing models in various benchmarks, such as multiple English and multilingual benchmarks.
- The overall writing and description is straightforward and easy to follow. The authors have well discussed some of the limitations such as a large money footprint for multilingual tokenizers.
Weaknesses
- Despite the diverse languages included in the dataset, I wonder if some languages are truly helpful for learning better multilingual capabilities. For example, if language A covers concept X and language B also covers concept X, I understand how the model positively learns from both languages A and B for concept X through this shared knowledge. But would scenarios where one language is missing some knowledge also be beneficial? I am curious to hear the authors' insights.
- It would be useful for many downstream language-specific tasks to see which languages perform best/worst and how this correlates with training data volume across multilingual benchmarks.
- I wonder how the authors handle cases with text in images, such as images with text in language A but where the alt-text is in a different language B, and in some cases they are not well aligned.
问题
Please try to address my question in the above Weaknesses section.
局限性
Yes
最终评判理由
I have read the authors' rebuttal, and most of my concerns, specifically related to multilingual benchmarks and language-specific performance are properly addressed, and I don't have further questions.
格式问题
No major concerns.
We sincerely appreciate your dedicated review and recognition of our work as addressing 'a fundamental limitation in current CLIP training' and for describing our approach as 'a simple yet effective recipe'. We address your profound questions as following:
“Despite the diverse languages included in the dataset, I wonder if some languages are truly helpful for learning better multilingual capabilities. For example, if language A covers concept X and language B also covers concept X, I understand how the model positively learns from both languages A and B for concept X through this shared knowledge. But would scenarios where one language is missing some knowledge also be beneficial?. I am curious to hear the authors' insights.”:
Yes, we do see that missing concepts in one language can be recovered from adjacent knowledge in other languages. We believe this is one of the major reasons to break the curse of multilinguality (c.f., table 1 in the main paper where Worldwide MetaCLIP H/14 outperforms English only MetaCLIP in English benchmarks, and outperforms non-english only counterpart in multilingual benchmarks).
We design an experiment to further illustrate how concepts in one language benefit understanding in other languages. We take images from CVQA and prompt in English the H/14 worldwide and English-only models respectively to examine their image understanding capability. We found that the model trained with worldwide data performs better when the images are about culture or region specific concepts, such as local food, transportation methods, or cultural activities. For example:
- There is one image in CVQA that shows “Tankwa” (boat-like transportation from Ethiopia). The worldwide model correctly recognizes it as a method of water transportation, while the English-only model predicts it as wheeled vehicles like bikes. (Here we prompted the model in random order with: water transportation, wheeled vehicles, aircrafts, or rails)
- One image shows the full body of “shachihoko” (a mythical Japanese creature with a tiger head and carp body). The worldwide model correctly predicts it as a creature in Japanese folklore; the English-only model predicts a carp. (prompt: creature in Japanese folklore, dragon, carp, tiger)
- One image shows “Dragon's beard candy” (a Chinese confection made from stretched fine yellow strands of maltose syrup, with strands drawing various patterns). The worldwide model correctly predicts it was made of sugar; the English-only model predicts glass. (prompt: sugar, silk, glass, silicon)
Note that the experiment above is just for extracting some insight but not rigorous evaluation. We believe concepts are hierarchical and continuous, so in almost all cases there is some overlap between concepts (e.g., both water transportation and wheeled vehicles are transportation mechanisms, so both the worldwide and English models get the concept right one layer above). Rigorous evaluation requires careful design of the examples/benchmarks with deep understanding of concept hierarchy.
“It would be useful for many downstream language-specific tasks to see which languages perform best/worst and how this correlates with training data volume across multilingual benchmarks.”:
This is a very insightful suggestion. In the following, we summarize downstream task performance in XM3600 for languages from the top-10 languages occurring in training data versus the rest:
Top-10 language in training data
| Language | Text-to-Image | Image-to-Text |
|---|---|---|
| en | 51.6 | 62.2 |
| es | 57.2 | 72.5 |
| fr | 67.1 | 78.5 |
| zh | 61.1 | 72.6 |
| ru | 67.8 | 79.9 |
| ja | 65.1 | 79.9 |
| id | 65.8 | 78.3 |
| pt | 60.4 | 72.6 |
| de | 69.2 | 83.6 |
| vi | 61.1 | 76.2 |
Languages not in the top-10 training languages:
| Language | Text-to-Image | Image-to-Text |
|---|---|---|
| ar | 47.4 | 60.8 |
| bn | 39.4 | 47.1 |
| cs | 51.0 | 66.1 |
| da | 61.0 | 75.1 |
| el | 52.1 | 68.4 |
| fa | 56.9 | 70.3 |
| fi | 59.3 | 73.7 |
| fil | 24.8 | 36.7 |
| hi | 26.1 | 41.8 |
| hr | 57.3 | 72.9 |
| hu | 63.9 | 76.5 |
| it | 64.0 | 78.2 |
| he | 60.8 | 76.2 |
| ko | 54.8 | 70.1 |
| mi | 0.5 | 1.2 |
| nl | 53.2 | 66.9 |
| no | 57.7 | 73.2 |
| pl | 61.4 | 75.9 |
| quz | 2.5 | 6.5 |
| ro | 64.8 | 77.8 |
| sv | 57.6 | 73.8 |
| sw | 10.0 | 16.6 |
| te | 26.1 | 37.1 |
| th | 57.7 | 71.4 |
| tr | 55.7 | 68.4 |
| uk | 60.0 | 74.7 |
We do see correlation to some extent, e.g., in general the performance from Top-10 languages is higher than 50%, whereas performance of many languages not in the Top-10 (e.g., bn, hi) is lower than 50%. However, the performance is not strictly decided only by volume. For example, we see English, even though with the largest volume, is not the best performing language (which is de) and worse than many others; we also see many non Top-10 languages also achieve 50%+ performance.
Beyond the volume of available data, we believe there are several other factors affecting performance: (1) the linguistic and cultural proximity to other languages, and (2) the structural characteristics or expressiveness of the language itself.
“I wonder how the authors handle cases with text in images, such as images with text in language A but where the alt-text is in a different language B, and in some cases they are not well aligned.”:
This is a great question. We believe that such data is present with a noteworthy amount in our training set due to our Internet scale of data volume. Our data curation process retains such data deliberately, as 1) this type of data enables cross-modal translation capability in our model as by-product, and 2) designing special processing (e.g., OCR and translation to extract the text and check alignment with alt-text) may overcomplicate the data algorithm which hinders generalizability and reusability, and introduces unbounded biases to distribution of curated data. To validate our first argument, we conducted the following preliminary experiments. We take an image which visually showing the Chinese character “狗”, which means dog. Then, we compute cosine similarity with different concepts to classify the image, and show the results (classes and similarities) in the following. As we can see, as expected the class for “狗” (“dog” character) gets highest cosine similarity. Interestingly, “dog” tops the candidates in English, and “いぬ” (dog in Japanese) tops in Japanese and gets much higher cosine similarity than “dog” in English, probably due to the higher language correlation between Chinese and Japanese. These facts suggest our models do pick up cross-modal translation capabilities from the data, even with potential noises and mis-alignment between OCR texts and alt-texts. We believe at Internet scale, existence of such noises in data is inevitable, but being faithful to the natural distribution mitigates most noises.
| Word | Description | Score |
|---|---|---|
| 狗 | "dog" in Chinese (exactly visualized on image) | 0.5432458 |
| 犬 | "dog" in Chinese, literary/ancient usage | 0.04636497 |
| 猫 | "cat" in Chinese | 0.0002509505 |
| 豺 | "jackal" or "wild dog" in Chinese | 0.03426899 |
| 狼 | "wolf" in Chinese | 0.014051365 |
| dog | English "dog" | 0.08239085 |
| diagram | Unrelated word | 0.0014264605 |
| cat | English "cat" | 0.00005071296 |
| puppy | English "puppy" | 0.028257972 |
| hound | English "hound" | 0.055857323 |
| いぬ | "dog" in Japanese | 0.19319624 |
| ねこ | "cat" in Japanese | 0.0006383567 |
Thanks the authors for their detailed response! Most of my concerns are addressed and I don't have further questions.
Thank you again for looking into our responses and providing feedback to our submission.
Dear Authors and Reviewers,
I would like to thank the authors for providing detailed rebuttal messages. I would also like to thank reviewers pL9G and o5PA for already engaging in further discussion!
For the other reviewers, I would like to encourage you to carefully read all other reviews and the author responses and engage in an open exchange with the authors. Please post your first response as soon as possible within the discussion time window, so there is time for back and forth discussion with the authors. All reviewers should respond to the authors, so that the authors know their rebuttal has been read.
Best regards, AC
The paper presents a recipe for training a CLIP model on worldwide, multilingual, web-scale data. By scaling the model capacity and training data the "curse of multilinguality" can be overcome. The resulting model surpasses the English-only variant and achieves several new state-of-the-art results.
Strong and thorough experimental results are highlighted by the reviewers. Further strengths include addressing the important problem of how to effectively leverage the vast amount of non-English data on the web. Clarity and presentation were praised across reviews.
Transparent methods and code will be of interest and value to the research community.
One weakness raised by several reviewers is that the benefits are mostly demonstrated at large model scale. In addition, the method adds engineering complexity.
Reviews, rebuttals, and the discussion period were used well.
I strongly recommend to accept the paper to NeurIPS, the paper is solid, the results are highly relevant, and there is agreement to accept among all reviewers.