/10

Poster4 位审稿人

最低3最高4标准差0.5

ICML 2025

Learning from others' mistakes: Finetuning machine translation models with span-level error annotations

Lily H Zhang,Hamid Dadkhahi,Mara Finkelstein,Firas Trabelsi,Jiaming Luo,Markus Freitag

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose Training with Annotations (TWA) to improve the quality of machine translation models with imperfect examples that include detailed error annotations.

摘要

关键词

machine translationfine-grained annotationsmultidimensional quality metrics

评审与讨论

审稿意见

评分: 32025-02-14

This paper proposes training with annotations (TWA), a method for finetuning machine translation models using existing span-level error annotations. Unlike traditional approaches that rely on sequence-level feedback, TWA directly leverages fine-grained annotations for model training. The authors evaluate TWA on two language pairs from the WMT’23 test sets.

给作者的问题

See numbers (1)-(5) above.

论据与证据

The claims are generally sound, and the authors provide some empirical evidence to support them. However, I have concerns about their design choices and evaluation setup (detailed below). Overall, while the claims are reasonable, the experimental evidence is not entirely convincing and needs to be improved to show that their method really works.

方法与评估标准

The authors report results on WMT’23 (En-De and Zh-En), a recent test set commonly used in the MT literature. They could have used WMT’24, but I don’t think this is problematic. My main concern is the lack of human evaluation. The paper relies entirely on automatic evaluation metrics (Metric-X, COMET, and BLEURT). In particular, the authors apply TWA (their method) using MQM data from WMT’20-21. This data is used to train MetricX-23, which they also use for evaluation (this is acknowledged in Section 5.4).

理论论述

NA.

实验设计与分析

I checked the experiments carefully and appreciated the statistical significance tests. I’ll summarize my main concerns below:

(1) Evaluation. See my comment above about Metric-X. In addition, the gains in BLEURT are small, making it unclear whether the improvements are meaningful. Table 3 shows TWA and other baselines within the same quality cluster in some cases.

(2) The baselines appear too weak. Tables 1 and 2 show their performance against the systems submitted to WMT’20-21. The baselines are worse than the average, and the authors do not even consider models submitted to the shared tasks that followed (WMT’22-24). Is there any reason to start from these baseline models?

(3) The authors use an encoder-decoder architecture despite the evidence that decoder-only models perform better. Can you please clarify this decision? Please check Kocmi et al., 2024 (Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet).

(4) The authors focus only on two (high-resource) language pairs, even though MQM annotations exist for others. Was there a reason to test only two?

补充材料

I checked Appendices A-D and found them helpful.

(5) No code was submitted, making reproducibility more difficult. Are you planning to release the source code?

与现有文献的关系

The paper does a good job covering existing MT literature but lacks a discussion of related work that also uses feedback to improve MT (this idea is not new at all). For instance, the paper does not properly discuss a large body of work on improving MT through reranking at inference time (e.g., using quality estimation metrics) or finetuning. The main novelty here is applying these ideas specifically to MT span annotations (and, most importantly, using these annotations directly without the need for training new “quality estimation” models). Considering the broader impact of this work, it’s unclear how to apply these ideas outside of MT, where human annotations are scarce (but I don’t think this is a problem this paper needs to solve).

遗漏的重要参考文献

Welleck et al. 2019 (Neural Text Generation With Unlikelihood Training), which introduces the unlikelihood loss mentioned and used in this paper.
Ramos et al., 2024 (Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings). In the introduction, the authors mention that “While MQM data has previously been used to develop auxiliary reward or metrics models (Juraska et al., 2023; Rei et al., 2022), it has not been directly employed for training machine translation (MT) models.” There’s work using MQM annotations for training MT models, though. For instance, Ramos et al. (2024) use xCOMET (trained on MQM data) to provide fine-grained feedback during training. I believe this should be mentioned in the Related Work section.

其他优缺点

Strengths: The paper provides a simple approach for leveraging span-level annotations to improve machine translation. This is well-motivated in the introduction.

Weaknesses: See my comments above about weak baselines and evaluation.

其他意见或建议

Minor comments:

I think Appendix D is not mentioned in the main paper.
It would be good to add numbers to the equations.

Update after the rebuttal: I updated my score to 3.

作者回复

2025-04-01

Thanks for your review. Responding to your concerns below:

[Lack of human evaluation]:

While we agree that human evaluation would strengthen the results, we do show improvements over multiple automated quality metrics, including a held out metric that was not used during any part of model selection, and we also provide sample translations in the appendix.

[Evaluation. The MQM data used to train the MT models is also used to train Metric-X. In addition, the gains in BLEURT are small, making it unclear whether the improvements are meaningful. Table 3 shows TWA and other baselines within the same quality cluster in some cases.]:

We believe it to be important to include Metric-X as a quality metric specifically because it is trained to be sensitive to the information found in MQM data, which is not necessarily the case for other quality metrics. We note that the MQM data has not been used in either COMET or BLEURT. BLEURT gains are likely smaller due to the fact that it is a completely held out quality metric (i.e., we do not consider it at all during model checkpoint selection), and the fact that TWA remains in the top cluster consistently, oftentimes alone, is especially promising in the context.

[Baseline models appear weak. Is there any reason to start from these baseline models?]:

We chose to use the baseline models we did to have full control over the entire pipeline, reason about all the data seen by the model, and allow for full transparency. We additionally compared TWA with baseline methods on a production model and found on internal evaluation sets that it indeed outperformed baselines.

[The authors use an encoder-decoder architecture despite the evidence that decoder-only models perform better. ]:

We use an encoder-decoder architecture commonly used for MT given its superiority with respect to prediction latency. Moreover, moderate-sized decoder only models often under perform relative to encoder-decoder MT models [1, 2].

[The authors focus only on two (high-resource) language pairs, even though MQM annotations exist for others.]:

We specifically run the language pairs we do given MQM annotations already exist for these pairs. In WMT 20-21, only three language pairs contain both MQM annotations and test sets. Our current experiments use two of these three pairs, and we have added the results on the third (en->zh) below. Here, TWA once again significantly out performs other baselines.

	Metric-x	Comet	Bleurt
TWA	2.342	0.517	0.696
DPO	2.349	0.503	0.693
SFT	2.470	0.504	0.692
----------------	---------	-------	-------
TWA_with_refs	2.324	0.521	0.697
DPO_with_refs	2.325	0.510	0.694
SFT_with_refs	2.409	0.509	0.694
SFT_filter	2.373	0.513	0.693

[Are you planning to release the source code?]:

We are in the process of obtaining approval to release the code for the camera-ready version of the paper.

[Missing related work]:

Thanks for the feedback. We have added a section in the related work about work on improving MT through reranking at inference time (e.g., using quality estimation metrics) or finetuning, highlighting the reviewer’s point that the main novelty is using annotations directly without training new quality estimation models. We have also added the additional references.

Thank you for your review. In light of our response and additional results, we hope you'll consider raising your score.

[1] https://arxiv.org/pdf/2401.08417

[2] https://arxiv.org/abs/2202.00528

审稿意见

评分: 32025-03-11

This paper focuses on improving machine translation models by leveraging span-level error annotations. It proposes a new algorithm called Training with Annotations (TWA). The core idea of the TWA is applying a weighted span-level unlikelihood loss to error spans to encourage the model to learn which tokens to penalize. In experiments, authors pre-trained the base model on WMT’23 data and fine-tuned the base model with MQM data from WMT’20 and WMT’21. Experimental results on English-German and Chinese-English machine translation show that TWA outperforms baseline methods like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Ablation studies verify the effectiveness of each component of TWA, and the unlikelihood loss used in TWA is proven to be better than the negative likelihood loss.

给作者的问题

A. Experiment

“ .... with around 28,000 and 31,000 submission outputs for En→De and Zh→En, respectively", What are submissions?
“We pretrain En→De and Zh→En models using the parallel WMT’23 training data (Kocmi et al., 2023), which consists of 296 million sentence-level examples.” I am not sure whether the two directions are to use one multilingual dataset (296 million samples）to train one model or two bilingual datasets to train two models.
“SFT on the MQM annotated data is analogous to distilling the outputs of other MT systems, without taking into account the annotations.” What are the other MT systems? A fair setting would be to perform data distillation with the base model (self-training).
“we obtain MQM annotations of the base model’s translations and run TWA with this annotated data. ” How to obtain the MQM annotations？
Why not use large model-based translation as a baseline system?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Some experimental details need to be further clarified：

“ .... with around 28,000 and 31,000 submission outputs for En→De and Zh→En, respectively", What are submissions?

2）“We pretrain En→De and Zh→En models using the parallel WMT’23 training data (Kocmi et al., 2023), which consists of 296 million sentence-level examples.” I am not sure whether the two directions are to use one multilingual dataset (296 million samples）to train one model or two bilingual datasets to train two models.

“SFT on the MQM annotated data is analogous to distilling the outputs of other MT systems, without taking into account the annotations.” What are the other MT systems? A fair setting would be to perform data distillation with the base model (self-training).
“we obtain MQM annotations of the base model’s translations and run TWA with this annotated data. ” How to obtain the MQM annotations？

补充材料

N/A

与现有文献的关系

As described in Section 7, the proposed TWA can be applied to large language models in general.

遗漏的重要参考文献

Utilizing MQM data in Related Work:

ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback, Findings of the Association for Computational Linguistics: EMNLP 2023
Teaching Large Language Models to Translate with Comparison, AAAI 2024

其他优缺点

The paper is well-written and easy to follow. The fine-grained annotation method proposed in the paper is worth further discussion. The author's design of the TWA is very novel (e.g., non-error tokens following an error span are ignored as they are off-trajectory.), which is very similar to the process supervision method. These methods may also have some good effects on the logical reasoning of large language models.

其他意见或建议

N/A

作者回复

2025-04-01

Clarifying experimental details and questions:

Submissions are the machine translations from models entered into the WMT competition.
We pre-trained two base models using a different bilingual dataset for each.
The other MT systems are the specific models entered into the WMT competition (listed in [1]). The difference between SFT and TWA on these models’ outputs is the difference between pure distillation of offline data and smarter distillation that takes advantage of fine grained annotations. We also consider fine-tuning on annotations of online data in Section 6.5, showing that we can improve model quality this way as well.
We ran a human evaluation, following [2].
We chose to use the baseline models we did to have full control over the entire pipeline, reason about all the data seen by the model, and allow for full transparency (i.e. using standard publicly available WMT data). We additionally compared TWA with baseline methods on a production model and found on internal evaluation sets that it indeed outperformed baselines.

Also, thanks for the additional references! We've added these into the paper under other works that have considered the use of MQM data.

Thank you for your review. Please let us know if you have any additional questions; otherwise, would you be willing to consider raising your score?

[1] https://aclanthology.org/2021.wmt-1.1.pdf

[2] https://arxiv.org/abs/2104.14478

审稿意见

评分: 42025-03-14

This paper develops a simple finetuning algorithm, called Training with Annotations (TWA), to directly train machine translation models on this annotated data. TWA utilizes targeted span-level error information while also flexibly learning what to penalize within a span. Moreover, TWA considers the overall trajectory of a sequence when deciding which non-error spans to utilize as positive signals. Experiments on English-German and Chinese-English machine translation show that TWA outperforms baselines such as supervised finetuning on sequences filtered for quality and Direct Preference Optimization on pairs constructed from the same data.

给作者的问题

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

Yes,all

与现有文献的关系

Language models have advanced to the point where it is often difficult to improve them substantially via supervised finetuning on high-quality human-written examples alone; instead, recent efforts to improve language model or sequence-to-sequence model performance have largely relied on annotations of model generations, from preferences to per-sequence scores. Such data, coupled with techniques to learn from it have yielded impressive results for many top language models. Most efforts, however, consider only sequence-level labels, usually in the form of a scalar score assigned to the entire output. In contrast, this work investigates the potential of using fine-grained span-level annotations from offline datasets to enhance language model training. Unlike sequence-level annotations, span-level annotations provide information about specific segments within a sequence, offering more detailed information for model learning. Moreover, in many situations, collecting fine-grained information is similar effort to collecting sequence-level labels, making the former a practical form of data for improving model performance given a method that can take advantage of the information.

遗漏的重要参考文献

其他优缺点

This paper is well-writen and the idea is novel. Specifically, the development of the Training with Annotations (TWA) algorithm is a key contribution. TWA's ability to utilize targeted span-level error information and flexibly learn what to penalize within a span is innovative. This fine-grained control over the training process has the potential to lead to more accurate models. Additionally, the consideration of the overall sequence trajectory when using non - error spans as positive signals shows a comprehensive understanding of how sequences should be modeled.

其他意见或建议

作者回复

2025-04-01

Thank you for your positive review! We appreciate your compliment about the novelty of the idea and the feedback that the paper is well-written. We agree with how you related our work to the broader literature as well.

审稿意见

评分: 42025-03-14

This paper explored a new approach to fine-tune machine translation models by utilizing fine-grained span-level annotations for further quality improvement. Previous work mostly focus on sequence-level annotations. while this work takes advantage of more fine-grained span-level annotations. The authors carefully designed the experimental settings and conducted preliminary experiments to explore the best setting in a weighted span-level unlikelihood loss.

The idea is straight forward. Experimental results on two machine translation tasks are technically sound, highlighting the effectiveness of the propose approach utilizing span-level annotations.

给作者的问题

Please see the comments/questions in the sections above.

论据与证据

Figure 2 shows how their proposed approach works for the span-level annotations. Results also highlight the effectiveness of the proposed span-level approach against the conventional sequence-level approach (DPO).

The authors conducted an experiment in two different machine translation tasks for English-to-German and Chinese-to-English directions. For both directions, their approach shows consistent improvement. With results for a few more different directions, the paper could be more experimentally convincing. It would be also interesting to apply this approach in domain adaptation scenario requiring specific terminology selection. This would be another area that the proposed span-level approach might work well. To be clear, I am not asking for additional experiments for this direction.

方法与评估标准

The carefully designed experiments and its result are technically sound. The systems are evaluated in different metrics, and the proposed approach shows the best performance in most cases.

理论论述

N/A

实验设计与分析

Have you ever checked the n-gram match metrics such as sacreBLEU [1]? Since those COMET scores reflect well semantics rather than n-gram match, sacreBLEu score is another suitable metrics to assess effectiveness of your span-level approach. It would capture the span-level correction if it works as expected.

Can you give a few examples when ignoring off-trajectory tokens? When you take a look at those examples, any patterns leading to those semantic evaluation metrics improvement (e.g., fluency)?

[1] Post. "A Call for Clarity in Reporting {BLEU} Scores" In Proc of WMT 2018.

补充材料

Regarding D Sample translations, can you please add a few more description or analyses on how to interpret those different translation outputs? We would like to see the baseline's output before applying TWA, SFT and DPO, and understand which part are changed via span-level correction via the proposed approach.

与现有文献的关系

The proposed approach of the span-level error annotations could be applicable to other text generation tasks, with a good potential impact, considering that most research have been so far well studied at sequence level. The analyses could be improved by reporting any other trends (detected via n-gram match) or highlighting the span-level corrected parts, comparing to the baseline's outputs.

遗漏的重要参考文献

In addition to DPO, you could also cite some more relevant work:

Meng et al., "SimPO: Simple Preference Optimization with a Reference-Free Reward". In Proc of NeurIPS 2024.
Xu et al., "Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation" In Proc of ICML 2024.
Hong et al., "ORPO: Monolithic Preference Optimization without Reference Model" In Proc of EMNLP 2024.
Xu et al., "X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale". In Proc of ICLR 2025.

其他优缺点

Please see the comments in the sections above.

其他意见或建议

Please see the comments in the sections above.

伦理审查问题

N/A

作者回复

2025-04-01

Thank you for your review! Please see our response below:

[sacreBLEU ...]:

Good question. To compare the effectiveness of TWA in using targeted negative information we compare the BLEU score of the model after TWA on submissions and references versus the model after DPO on submissions and references. Please see the table below. TWA has a much higher BLEU score, indicating that its use of negative information is more precise.

	En→De	Zh→En
	BLEU	BLEU
TWA_subs_and_refs	34.24	25.30
DPO_subs_and_refs	30.71	15.66

[Ignoring off-trajectory tokens ...]:

We have pasted a few examples below of the decoded test translations between the model trained to ignore off-trajectory tokens versus the model trained with including off-trajectory tokens. We have also added these to the paper. We additionally asked Gemini Pro 2.5 Pro to compare the two; when we pass the csv of all En->De translations between ignoring and not ignoring, it states that the translations from ignoring off-trajectory tokens are “slightly more fluent, natural-sounding, and sometimes grammatically refined German translations,” compared to translations that are sometimes “literal or slightly more awkward in its phrasing or word choice.”

Source: If we did, we'd see these mass gun shootings go down. TWA: Wenn wir das täten, würden wir sehen, wie diese Massenschießereien untergehen. TWA(ignore): Wenn wir das täten, würden wir sehen, wie diese Massenschießereien abflauen. LLM preference: The second translation (TWA(ignore)) is better because "abflauen" more accurately conveys the idea of a decline or reduction in frequency, which aligns with the intended meaning of "go down" in this context. The first translation (TWA) uses "untergehen," which typically means "to perish" or "to go under," making it sound like the shootings themselves are disappearing in a more absolute or even unnatural way rather than decreasing in occurrence.

Source: We are still looking for ways to improve our services and this will be noted as a feedback to one of our valued customer. TWA: Wir suchen immer noch nach Möglichkeiten, unsere Dienstleistungen zu verbessern, und dies wird als Feedback an einen unserer geschätzten Kunden gemerkt. TWA(ignore): Wir suchen immer noch nach Möglichkeiten, unsere Dienstleistungen zu verbessern, und dies wird als Feedback an einen unserer geschätzten Kunden gewertet. LLM preference: The second translation (TWA(ignore)) is better because "gewertet" (evaluated/considered) more naturally conveys the idea that the feedback is being taken into account or acknowledged. In contrast, "gemerkt" (noted/remembered) sounds more like someone is simply taking a mental note, which is less formal and may not fully capture the intended meaning of acting on the feedback.

Source:I'll share a couple of steps to perform into your eReader, okay? TWA:Ich werde ein paar Schritte in Ihrem eReader ausführen, okay? TWA(ignore): Ich gebe Ihnen ein paar Schritte mit, die Sie in Ihrem eReader ausführen können, okay? LLM preference: The second translation (TWA(ignore)) is better because "Ich gebe Ihnen ein paar Schritte mit, die Sie in Ihrem eReader ausführen können" correctly conveys that the speaker is providing steps for the listener to follow. The first translation (TWA) "Ich werde ein paar Schritte in Ihrem eReader ausführen" incorrectly implies that the speaker themselves will perform the steps directly on the listener’s eReader, which changes the meaning of the sentence.

[Regarding D Sample translations ...]:

We have added the baseline model’s translations to each of the outputs in appendix D. We additionally asked Gemini Pro 2.5 to summarize the differences between all the outputs rather than just the few we pasted into the appendix, and this is a summary:

Literalness vs. Fluency: There's a clear spectrum. TWA, SFT, and Base tend towards more literal translations, preserving source structure. DPO aims for higher fluency, sometimes sacrificing strict fidelity or introducing errors. Reliability: TWA and SFT are the most reliable, rarely producing complete nonsense. Base and DPO are significantly less reliable, with instances of hallucination or total failure. Technical Data Handling: Translating highly specific technical standards is challenging for all models, but TWA and SFT maintain better accuracy and detail compared to Base and DPO. TWA demonstrates an edge in consistently rendering the most intricate technical specifications with higher fidelity compared to SFT, which occasionally displays deviations in complex cases.

[Additional related work]:

Thanks for these additional works. We have added them to the related work, presenting them as alternatives to DPO which also contrast pairs of responses.

Thank you again for your review. If you find our response satisfactory, would you be willing to consider raising your score?

最终决定Accept (poster)

2025-05-01

This paper explored a new approach for finetuneing language models by utilizing fine-grained span-level annotations, as opposed to focusing on sequence-level annotations. While the approach is relatively straightforward, it is quite effective and shows improvements on two machine translation tasks and consistent improvements. While the reviewers are generally satisfied with the paper, they note several weaknesses, such as evaluating on only two language pairs, concerns with baselines, and limitations in the analysis (e.g., we would like to see the baseline's output before applying TWA, SFT and DPO). Finally, while the motivation of the work applies to language models – as highlighted in the introduction – but the work is only evaluated on MT.

Given the reviewers’ overall positive sentiment and that no major concerns remain after discussions, I recommend accepting the paper, and I encourage the authors to add a third language pair in the final version of the paper (en->zh as included in the rebuttal).