Aya Vision: Advancing the Frontier of Multilingual Multimodality
We release the recipe to build state of the art multilingual multimodal models
摘要
评审与讨论
In order to address the challenges of data scarcity and catastrophic forgetting in building large multilingual multimodal models, this paper proposes a synthetic annotation framework to generate high-quality and diverse multilingual textual data, as well as a simple and effective cross-modal model merging technique. Extensive experimental analyses confirm that this is a solid piece of work.
优缺点分析
Strengths:
-
Effective synthetic annotation framework. The authors provide a detailed description and demonstration of the pipeline, including data collection, recaptioning, filtering, and translation. They also present multiple metrics to show improvements in text quality and diversity, which strengthens the validity of their approach.
-
Effective cross-modal model merging. The proposed training-free method is simple and lightweight. Subsequent ablation studies clearly demonstrate its effectiveness.
-
Comprehensive experimental evaluation. The authors conduct extensive experiments on a wide range of multilingual multimodal benchmarks. The ablation studies are well-designed and sufficient to support the effectiveness of the proposed methods.
-
Solid contributions. The dataset, method, and models proposed in this work have the potential to significantly advance research in the field of multilingual and multimodal learning.
Weaknesses:
- Subpar performance. Despite having a larger parameter size, Aya-Vision-8B underperforms compared to Qwen-2.5-VL-7B on most of the benchmark tasks, as shown in Table 1.
- Inadequate Filtering. In the second stage of this filtering process in Section 2 (line 131), a unimodal LLM is employed to detect semantic inconsistencies between the original and recaptioned captions to mitigate the hallucinations and nuanced errors. However, due to the lack of visual information, it may fail to identify hallucinations introduced by the MLLM, making the semantic filter appears inadequate.
- Ineffective Post-Editing. In the post-editing stage of the Hybrid Translation Pipeline, refining the translated text without referencing the original source may fail to effectively correct issues like linguistic artifacts, cultural biases, and semantic misalignments introduced by machine translation. For example, as shown in Figure 2, the Chinese translation of a rephrased prompt changes the meaning from "what kind of beverage is in the French press" to "what kind of beverage is usually brewed in the French press", shifting the focus from the actual image content to general knowledge. This illustrates the limitation of the current approach in preserving task-specific semantics during translation.
- Typographical error. Line 176: "cross-model" should be corrected to "cross-modal". In line 167, there is an unnecessary comma following the colon, which should be removed.
- Format Issues. It is recommended to standardize the use of the hyphen in “Aya-Vision” across Sections 5 and 6 for consistency. The description in Figure 4 does not accurately match the image content, as the left panel shows six compared models, whereas only five are mentioned in the caption.
问题
-
In the Hybrid Translation Pipeline described in Section 2, the authors note that GPT models perform well in high-resource languages but often struggle in low-resource settings. This indeed a common issue among current large language models. However, the authors ultimately choose the NLLB-3.3B model as their translation model. What are the unique advantages of this model, and does it effectively address the aforementioned issues in low-resource languages? This requires further clarification from the authors.
-
In Line 167, the authors mention generating a balanced fine-tuning mix from three data sources. The reviewers would like to know whether this "balanced" ratio was determined through experimentation or based on prior experience. Furthermore, unlike many previous works, this study does not incorporate text-only data. The authors should explain the rationale behind this design choice.
局限性
yes
最终评判理由
Thank you to the authors for the responses, which have effectively addressed my concerns.
格式问题
n/a
We sincerely thank the reviewer for their exceptionally thorough and constructive feedback. We are particularly heartened by their overall assessment of our paper as "a solid piece of work" and highlighting the "effective synthetic annotation framework" and "effective cross-modal model merging" as well as the "comprehensive experimental evaluation".
In the Hybrid Translation Pipeline described in Section 2, the authors note that GPT models perform well in high-resource languages but often struggle in low-resource settings. This indeed a common issue among current large language models. However, the authors ultimately choose the NLLB-3.3B model as their translation model. What are the unique advantages of this model, and does it effectively address the aforementioned issues in low-resource languages? This requires further clarification from the authors.
We chose NLLB-3.3B primarily because it offers a good balance between translation quality and computational efficiency. Unlike other proprietary models, NLLB-3.3B is open-source, relatively lightweight, and cost-effective for inference at scale, particularly important when translating large volumes of text across many languages, as we can deploy it locally. Beyond cost, another core motivation for choosing NLLB-3.3B is its language coverage. It was explicitly designed for many low-resource language translations, enabling better initial translations in those cases.
In Line 167, the authors mention generating a balanced fine-tuning mix from three data sources. The reviewers would like to know whether this "balanced" ratio was determined through experimentation or based on prior experience. Furthermore, unlike many previous works, this study does not incorporate text-only data. The authors should explain the rationale behind this design choice.
We thank the reviewer for this insightful question about our data mixture and rationale.
- On Determining the Balanced Ratio:
The final data mixture was determined through careful, iterative experimentation, rather than an exhaustive brute-force grid search. Our primary goal was to create a model that excels as a conversational multilingual multimodal assistant, and our experiments were guided by balancing three competing objectives:
- Maximizing Multilingual Performance: Our core objective was strong performance across 23 languages. We found that increasing the proportion of translated data improved multilingual win-rates, but only up to a point. Beyond a certain threshold, as shown in Figure 7, we observed a drop in performance, likely due to reduced cross-lingual transfer.
- Maintaining High-Quality English Performance: While our focus was multilingual, strong English performance was also crucial. We observed that adding more English-only synthetic data improved English win-rates but began to degrade performance in other languages.
- Adhering to Academic Benchmark Formatting: As shown in our ablations (Figure 9), a portion of the original academic datasets was necessary to ensure the model’s outputs aligned with the specific formats required for evaluation. The final ratio presented in the paper represents the optimal trade-off we identified through this experimental process to satisfy all three objectives effectively.
- On the Rationale for Excluding Text-Only Data:
As we show in Figure 6 and Section 7, addition of text-only data improves text-only win-rates but does not transfer over to multimodal tasks. Moreover, it introduces another variable to be tuned to find the optimal data mixture. Our cross-modal merging approach, on the other hand, improves both text-only and multimodal win-rates significantly without training – saving thousands of H100 GPU-hours in experimentation.
We appreciate the opportunity to help us better explain our reasoning behind these 2 design decisions and strengthen the paper. We shall add the above discussion to the final version of the manuscript.
Subpar performance Despite having a larger parameter size, Aya-Vision-8B underperforms compared to Qwen-2.5-VL-7B on most of the benchmark tasks, as shown in Table 1.
We would like to reiterate that Aya Vision models are optimized for fluent open-ended generations which are not reflected in the academic benchmarks shown in Table 1. These tasks are typically evaluated in a 0-shot manner with the answer and format closely resembling the training set. Nonetheless, we find that Aya Vision models are still competitive in their model class. This was one of key motivations for introducing Aya Vision Bench as an open-ended generative evaluation that holistically evaluates a model’s generation.
Additionally, since the results for widely used academic evals are reported on the validation set with the answers readily available on the internet, there is a possibility of inadvertent test set contagion for some of the models of comparison.
Inadequate Filtering In the second stage of this filtering process in Section 2 (line 131), a unimodal LLM is employed to detect semantic inconsistencies between the original and recaptioned captions to mitigate the hallucinations and nuanced errors. However, due to the lack of visual information, it may fail to identify hallucinations introduced by the MLLM, making the semantic filter appear inadequate.
We thank the reviewer for the valuable comment. The primary goal of the second-stage filtering process was to detect incorrect annotations, specifically, cases where the recaptioned data diverges from the ground truth answers. Such discrepancies often indicate that the recaptioned response contains more serious errors, in which case the sample is treated as a hard negative and is entirely removed from the training data. While we employ a text-only LLM to identify these inconsistencies, we acknowledge that this approach may be limited in detecting hallucinations that appear in the reasoning or explanation or subtle errors that rely on visual context. Our choice of a LLM was motivated by considerations of scalability and efficiency. We agree that using a more visually grounded model could enhance hallucination detection, and we view this as a promising direction for future work.
Ineffective Post-Editing In the post-editing stage of the Hybrid Translation Pipeline, refining the translated text without referencing the original source may fail to effectively correct issues like linguistic artifacts, cultural biases, and semantic misalignments introduced by machine translation. For example, as shown in Figure 2, the Chinese translation of a rephrased prompt changes the meaning from "what kind of beverage is in the French press" to "what kind of beverage is usually brewed in the French press", shifting the focus from the actual image content to general knowledge. This illustrates the limitation of the current approach in preserving task-specific semantics during translation.
We agree with the reviewer on the importance of preserving task-specific semantics during post-editing. We would like to clarify that the original source text was provided during the rephrasing process. As detailed in Appendix Section N ("Translation Rephrasing Template"), the rephrasing model had access to both the machine-translated text and the original english prompt, which were used to guide the edits and ensure semantic consistency.
Typographical error & format Issues
We would like to thank the reviewer for bringing this to our notice. We are very happy to address these in the camera ready version.
Thank you to the authors for the responses, which have effectively addressed my concerns.
This paper introduces Aya Vision, a family of open-weight, multilingual, multimodal models in 8B and 32B sizes that operate across 23 languages. The work addresses the key challenges of multilingual data scarcity and "catastrophic forgetting," where models lose text-only capabilities after vision training. To solve this, the authors developed:
- A synthetic annotation framework that enhances and translates data using a process of recaptioning and context-aware rephrasing to create high-quality, fluent multilingual datasets.
- A training-free cross-modal model merging technique that blends the weights of the final multimodal model with its original text-only counterpart, which successfully preserves text performance while boosting multimodal capabilities.
As a result, Aya Vision models achieve state-of-the-art performance, outperforming significantly larger models on both multimodal and text-only tasks. The paper also contributes AyaVisionBench, a new benchmark for evaluating generative multilingual models.
优缺点分析
Strengths
- This paper is well written and easy to follow.
- Instead of relying on simple machine translation, the authors developed a multi-stage synthetic annotation framework which improves the translation quality.
- Cross-modal model merging is a training-free and highly effective solution to catastrophic forgetting.
- The resulting Aya Vision models achieve strong performance across a range of evaluations.
Weaknesses
- The quality of the entire data pipeline and evaluation process is heavily dependent on the performance of other large-scale, often proprietary, model, for example
command-r-plus,claude-3.7-sonnet. This reliance means that the results are contingent on the capabilities and potential biases of these external systems. - While the paper's "cross-modal merging" is presented as a novel and effective empirical method, its underlying theoretical mechanism is not deeply explored.
问题
See Weaknesses.
局限性
Yes
最终评判理由
My concern has been addressed and I'd like to raise my score to 5.
格式问题
No
We thank reviewer 1h2E for their positive review of our work, emphasising that “cross-modal merging is a highly effective solution to catastrophic forgetting” and noting the “strong performance across a range of evaluations”.
The quality of the entire data pipeline and evaluation process is heavily dependent on the performance of other large-scale, often proprietary, model, for example command-r-plus, claude-3.7-sonnet. This reliance means that the results are contingent on the capabilities and potential biases of these external systems.
Thank you for raising a fair point regarding our use of model choices in the data pipeline. We agree that the choice of model can influence the pipeline’s overall behavior. But, we would like to clarify that our approach is not dependent on proprietary models. For instance, in the translation component where we currently use CommandR+ (this choice was motivated by its strong multilingual performance across both high‑ and low‑resource languages) one could substitute a recently released open-weight multilingual model without significantly impacting the pipeline's performance. Our goal is not to promote a specific model, but rather to introduce a method of leveraging available models, whether proprietary or open-weight, for tasks like recaptioning and correctness verification. In fact, the core contribution of our work lies in demonstrating how to systematically use these tools to generate high-quality synthetic data, not in the specific models themselves.
Moreover, we use different models for translation, rephrasing, and evaluation to reduce the risk of systematic bias or model-specific artifacts influencing both data generation and evaluation. By decoupling these roles, we ensure that the pipeline’s results are not inadvertently optimized for the biases of any single model. As open‑weight multilingual models continue to improve, we expect the pipeline to become increasingly independent of proprietary systems. Importantly, the core contribution of our work lies in the method, demonstrating that starting from a strong English‑centric model, combining it with a robust translation and recaptioning pipeline, and systematically curating synthetic data can lead to substantial multilingual performance gains, regardless of the specific models used. Regarding the evaluation, we would like to clarify that our choice of Claude-3.7-Sonnet as the evaluator in AyaVision Bench was based on preliminary experiments showing it had the highest agreement with human judgments among the evaluators we tested. As detailed in Appendix E.1.1, this decision was informed by a detailed comparative study using the translated Multimodal RewardBench across 8 languages, where Claude-3.7-Sonnet outperformed GPT-4o and Gemini-2.0-Flash by 6.4% and 25.8%, respectively, in preference ranking accuracy. While AyaVision Bench is an automatic win-rate evaluation, it is built on this careful evaluator selection to ensure reliability. Additionally, we complemented it with established academic multimodal benchmarks to provide a comprehensive and balanced evaluation of our model.
While the paper's "cross-modal merging" is presented as a novel and effective empirical method, its underlying theoretical mechanism is not deeply explored.
While we show the benefits of cross-modal merging through an empirical lens, a deep dive into the theoretical mechanism enabling this behaviour is out of the scope of this work. The theory behind the surprising effectiveness of model merging in a single modality itself is still an active area of research. Some recent works such as Model Soups [1] and Linear Mode Connectivity and the Lottery Ticket Hypothesis [2], has shown that one can often find a linear path between two such solutions (i.e., and ) where the loss remains consistently low. Our linear interpolation is explicitly traversing this low-loss path. Finding an optimal α is equivalent to finding the sweet spot on this path that offers the best trade-off of specialized capabilities. The existence of these paths is a key enabler for successful model merging.
[1] Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
[2] Linear Mode Connectivity and the Lottery Ticket Hypothesis
Note to Area Chair: The paper’s title [Aya] reveals clues about the authors’ identities, which may impacted my review.
This paper presents Aya Vision, a multilingual multimodal model that uses synthetic data and cross-modal merging to overcome data scarcity and catastrophic forgetting, achieving strong performance on vision and text tasks while outperforming larger models.
优缺点分析
Strengths:
- Novel cross-modal model merging that efficiently preserves text performance while adding multimodal capabilities.
- Strong results across 23 languages, beating larger models with fewer parameters.
- Comprehensive ablation studies showing the impact of design choices clearly.
- Balanced multilingual data strategy that improves performance.
- New benchmark (AyaVisionBench) for evaluating multilingual multimodal systems.
- Commitment to open release, supporting reproducibility.
- Two sizes are considered: 8B and 32B
Weaknesses (not major):
- Missing details on training hardware/cost and scalability.
- Limited details on dataset quality.
- Heavy use of synthetic data may limit real-world generalization.
问题
I give a complete score, no questions. If you want, see the weaknesses mentioned.
局限性
yes
最终评判理由
I will keep my positive score the same as before.
格式问题
I did not notice any major formatting issues in this paper.
We would like to thank reviewer 5dHo for the support they’ve shown for our paper highlighting its strong results, comprehensive ablations and the novelty of the cross-modal merging technique.
Missing details on training hardware/cost and scalability.
We would like to thank the reviewer for drawing attention to this important aspect of model development. We discuss the computation footprint of training (in H100 hours) of Aya Vision 8B and 32B in section I of the Appendix. To ensure accessibility for the reviewers and AC, we present the computational footprint for both the models in the table below.
| Model | Alignment | SFT |
|---|---|---|
| Aya Vision-8B | 384 | 2176 |
| Aya Vision-32B | 3072 | 5120 |
Training compute requirements in H100 GPU-hours
Limited details on dataset quality.
We understand that data quality is crucial for effective model training. To ensure high-quality data, we implemented several measures and evaluated the dataset from multiple angles, including both downstream performance and quantitative metrics. Specifically, we compared model performance when trained on the recaptioned data versus the original data (Figure 9), a 17% gain in win rate from using our SFT data compared to trained on original data.To assess multilingual data quality, we used COMET scores (see Table 7 in the appendix), which improved from 0.75 following NLLB translation to 0.83 after post-editing in average across 23 languages. We also analyzed average token length (increase from 27.2 to 140.8) and lexical diversity (measured by MTLD, improved from 11.0 to 61.2) in the recaptioning step. Additionally, we applied a two-step filtering process to enhance the correctness of the recaptioned data and reported the overall error rate (3.2%)
Heavy use of synthetic data may limit real-world generalization.
We thank the reviewer for this insightful question regarding the potential limitations of synthetic data. Our approach was directly motivated by the significant deficiencies in existing vision-language instruction tuning (IFT) datasets which themselves limit real-world generalization. These datasets often lack naturalness, are overly task-specific with short-form answers, and are frequently compiled by scraping the web, resulting in noisy and poorly cleaned data. To address these issues, our synthetic reannotation framework builds on real images and their original captions. We use the original captions as ground truth and generate improved versions that are more natural, detailed, and fluent, thereby enhancing the quality of the training data. In this sense, our “synthetic” data is not artificially detached from reality but rather a refinement of real-world annotations. As shown in Figure 9, this strategy yields a 17% improvement in win rate directly attributable to our novel data framework. This result underscores that our synthetic approach, rather than being a limitation, is a crucial step to overcome the shortcomings of existing datasets,improve real-world performance and generalization.
This paper proposes Aya Vision, a family of multilingual multimodal LLMs (8B and 32B) trained to follow instructions in 23 languages. The authors introduce an annotation pipeline that combines machine translation with LLM-based rephrasing to improve the quality and diversity of multilingual multimodal data. They also propose a training-free cross-modal model merging technique that preserves text-only performance while enhancing multimodal capabilities. Additionally, the paper provides a new multilingual benchmark for evaluation.
优缺点分析
Strengths:
- The paper is well-written and easy to follow.
- The idea of training-free modal merging is simple but effective.
- The Figure 9 clearly show that each component brings notable performance gains supporting the effectiveness of the overall design.
- The proposed model outperforms existing open-weight baselines on multilingual multimodal benchmarks.
Weakness:
- Although the authors introduce a pipeline that combines machine translation with LLM-based rephrasing to curate training data, there is no comparison with directly using an LLM for translation. It remains unclear whether the proposed approach leads to better performance than using an LLM for translation alone. An ablation study would help clarify the benefit of combining machine translation with rephrasing.
- The merging approach blends weights between the text-only and multimodal models. It’s unclear whether simply using an intermediate checkpoint from early-stage multimodal fine-tuning could achieve a similar balance between text-only and multimodal performance. A comparison would help clarify whether merging offers a real advantage over this simpler alternative.
问题
- How does the translation task improve performance on low-resource languages when using machine translation plus LLM-based rephrasing, compared to directly using an LLM for translation?
- Figure 7 shows as the alpha increase, the multimodal performance somehow decrease at the left figure. When alpha is 1 the weight is all multi-modal weights and it should have the best performance at multi-modal. I wonder why it shows decrease trend as alpha increase.
局限性
Yes
最终评判理由
The author provided detailed replies to my questions and addressed my concerns. The author clarified the necessity of using machine translation followed by an LLM for rephrasing, as opposed to directly using an LLM for translation. They also addressed my question about why multimodal performance improves when using modal merging.
格式问题
No
We would like to thank reviewer B7Pr for their detailed review. We are particularly grateful that they found our “training-free cross-modal model merging simple yet effective” and highlighted that “each component brought about notable performance gains” and “the model outperformed existing open-weight baselines on multilingual multimodal benchmarks”. We address their concerns and questions below:
Although the authors introduce a pipeline that combines machine translation with LLM-based rephrasing to curate training data, there is no comparison with directly using an LLM for translation. It remains unclear whether the proposed approach leads to better performance than using an LLM for translation alone. An ablation study would help clarify the benefit of combining machine translation with rephrasing.
We thank the reviewer for this insightful question and would like to clarify the design choice behind our multilingual data pipeline. While we did not conduct a direct ablation study, our approach is motivated and supported by several prior works. Although using large language models directly for machine translation is a straightforward option, prior studies have shown that LLMs only perform competitively on high-resource languages but still face challenges with low-resource languages [3, 4]. For instance, as reported in a related multilingual multimodal study [1], issues such as incorrect punctuation spacing, omission or misapplication of Arabic grammatical features (e.g., nunation), and untranslated English words are common when using LLMs alone. We observed similar issues during our own experiments.
On the other hand, machine translation models like NLLB demonstrate more consistent performance across languages but are not without limitations. Notably, translation artifacts such as translationese frequently occur. Therefore, in this paper, we propose a hybrid approach: using NLLB to generate initial translations as in-context references, followed by rephrasing with a multilingual LLM. We believe this yields more natural and accurate translations. This design is supported by recent studies [4, 5], which suggest that combining LLM with traditional translation systems and leveraging examples in in-context learning can enhance translation quality, especially for low resources languages.
Moreover, given the known tendency for models to favor their own generations when evaluated on data they produced [6,7], we deliberately avoided using an LLM for translation in our evaluation data to prevent bias.
Finally, we evaluated multilingual data quality using COMET scores (Table 7, Appendix Section O), comparing outputs from NLLB alone versus NLLB followed by LLM-based rephrasing (improved from 0.75 to 0.83 in average). The results show consistent improvements across all 23 languages, for example, Japanese increased from 0.7271 to 0.8596, and Arabic improved from 0.7411 to 0.8213.
The merging approach blends weights between the text-only and multimodal models. It’s unclear whether simply using an intermediate checkpoint from early-stage multimodal fine-tuning could achieve a similar balance between text-only and multimodal performance. A comparison would help clarify whether merging offers a real advantage over this simpler alternative.
Based on the reviewer’s suggestion, we pick a checkpoint 70% into the training and do not perform any merging. We compare this model against a model trained to completion followed by cross-modal merging by calculating win-rates on a subset of Aya Vision Bench averaged across 7 languages. We present our results in the table below. The results demonstrate the efficacy of merging in improving the quality of generations over using a prior checkpoint.
| Training | Multimodal Win-Rates |
|---|---|
| 70% + no merging | 53.5% |
| 100% + merging | 68.9% |
How does the translation task improve performance on low-resource languages when using machine translation plus LLM-based rephrasing, compared to directly using an LLM for translation?
Similar to the discussion mentioned above, we evaluated the quality of multilingual data using COMET scores, widely regarded as a gold-standard metric for machine translation evaluation (Table 7, Appendix Section O). We compared outputs generated by NLLB alone with those refined through LLM-based rephrasing. The results demonstrate that rephrasing consistently improves translation quality across all 23 languages, where the COMET score improvement is between 1.8% (Hindi) to 80.3% (Chinese).
Figure 7 shows as the alpha increase, the multimodal performance somehow decrease at the left figure. When alpha is 1 the weight is all multi-modal weights and it should have the best performance at multi-modal. I wonder why it shows decrease trend as alpha increase.
The trend observed in Figure 7 is a key finding of our work and highlights one of our core contributions. As the merge-fraction alpha increases beyond an optimal point (around 0.6 in this case), the multimodal win rate begins to decrease. This might seem counterintuitive, as an alpha of 1.0 represents the fully fine-tuned multimodal model. However, this phenomenon reveals a crucial benefit of our cross-modal merging technique. Merging with the original text-only LLM does not just preserve text-only capabilities; it also enhances the generation quality on Multimodal inputs – an attribute that is captured by generative benchmarks like Aya Vision Bench and missed by most academic evals. Additionally, we also observed that merging especially improves win-rates on low-resource languages. We show the increases in the win-rates against Pangea-7B on Aya Vision Bench through merging compared to a finetuned Multimodal model (prior to merging) for a few languages in the table below.
| Language | Pre-Merge | Post-Merge | Delta |
|---|---|---|---|
| deu_Latn | 47.41% | 69.17% | 21.76% |
| tur_Latn | 48.89% | 69.92% | 21.03% |
| kor_Hang | 55.56% | 76.12% | 20.56% |
| fra_Latn | 51.11% | 70.90% | 19.79% |
| por_Latn | 53.33% | 73.13% | 19.80% |
[1] PALO: A Polyglot Large Multimodal Model for 5B People
[2] Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
[3] How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation
[4] Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis
[5] Leveraging GPT-4 for Automatic Translation Post-Editing
[6] LLM Evaluators Recognize and Favor Their Own Generations, 2024.
[7] LLM see, LLM do: Leveraging active inheritance to target non-differentiable objectives, 2024.
Thank you for the clarifications and for addressing my concerns. I believe the idea of merging is an interesting finding. I will increase my rating to 5.
Dear reviewer B7Pr,
Thank you for your appreciation and recognition of our work. Your input has been extremely valuable in helping us strengthen this manuscript and we shall incorporate the above discussion in the camera-ready version.
This paper introduces Aya Vision, a family of multilingual multimodal models (8B and 32B) designed to address data scarcity and catastrophic forgetting. The contributions include 1) a synthetic annotation pipeline combining translation and rephrasing to improve multilingual data quality, 2) a training-free cross-modal model merging technique that preserves text performance while adding multimodal capabilities, and 3) the release of AyaVisionBench for multilingual multimodal evaluation.
Strengths:
- Novel and effective cross-modal merging strategy that avoids catastrophic forgetting.
- Demonstrated strong performance across 23 languages, outperforming larger baselines.
- Comprehensive ablations and detailed data quality analysis.
- Commitment to open release, improving reproducibility and community impact.
Weaknesses:
- Reliance on synthetic data raises questions about real-world generalization.
- Some concerns about reliance on proprietary models in the pipeline and limited theoretical explanation of merging.
- The proposed weight merging strategy should be compared with established alternatives, or at least discussed, such as blending text-only SFT data during multimodal training (i.e., NVLM, Dai et al.), to more clearly demonstrate its advantages.
- Aya-Vision-8B underperforms certain baselines on standard benchmarks.
- Missing details on compute costs and scalability.
Discussion & Rebuttal: The authors provided clarifications and additional analyses addressing most concerns. They detailed training compute requirements, explained dataset quality controls, and demonstrated that merging outperforms alternatives such as early checkpoints. They also clarified the rationale behind translation pipeline design and filtering choices. However, one key limitation remains: the lack of direct comparison between cross-modal weight merging and established approaches like mixing text-only data during multimodal training (as illustrated in NVLM, Dai et al.). This gap makes it harder to fully quantify the relative benefits of the proposed method. Reviewers acknowledged the strong empirical results overall and either maintained or raised their scores.
In summary: Reviewers are overall positive, with ratings ranging from accept - 5 to strong accept - 6. The work is technically solid, clearly written, and presents contributions of high potential impact for multilingual and multimodal learning. While some limitations remain (e.g., lack of comparison with established training strategies), they do not undermine the significance of the results.
Nevertheless, the Program Chairs have investigated and determined that this paper cannot be accepted under this year’s Responsible Reviewing Initiative, as at least one of the authors was found to be grossly negligent as a reviewer during the review process.