PaperHub
7.3
/10
Spotlight3 位审稿人
最低6最高8标准差0.9
6
8
8
4.3
置信度
ICLR 2024

A Benchmark for Learning to Translate a New Language from One Grammar Book

OpenReviewPDF
提交: 2023-09-24更新: 2024-03-15

摘要

关键词
low-resource languagesindigenous languagesendangered languageslong contextfield linguisticsunseen taskslarge language modelsmachine translationbenchmark

评审与讨论

审稿意见
6

This paper investigates how effectively LLMs adapt to a task which is guaranteed to have no overlap with LLM training data. The task is translation between English and Kalamang, an endangered language with little to no online presence. The authors introduce a new dataset, MTOB (Machine Translation from One Book), which contains (1) a linguistic analysis of the Kalamang language, (2) a bilingual dictionary, and (3) a small English-Kalamang parallel corpus. They benchmark several LLMs on the translation task, experimenting with different in-context learning settings. They find that the bilingual dictionary and parallel corpus enables some translation capabilities, and a large context size learning from the grammar book leads to notable performance gains. However, no LLMs outperform a human baseline.

优点

This work is a solid scientific investigation into an unexplored topic. The paper is excellently written - clear, well organised, and engaging.

-- 1. Novel benchmark --

MTOB is a unique benchmark that enables interesting experiments. It would undoubtedly be a useful resource for future work, as it offers an alternative to the current paradigm of raw text training and highly structured fine-tuning. The idea of using it to mimic second language learning is interesting.

-- 2. New experimental ideas --

The paper introduces a few new ideas in its experimental framework with the aim of testing generalization capabilities beyond the training set.

  1. Using content guaranteed to not be on the web.
  2. Testing model knowledge on the task before training to check for potential train-test overlap.
  3. The motivation of testing crystalised intelligence, as opposed to fluid intelligence.

These are all ideas that could find use in other contexts/domains.

-- 3. Interesting findings --

The experiments reported reveal some insightful findings. For example, the failure of traditional finetuning in this setting is interesting, and so is the difficulty of incorporating grammar knowledge into smaller context LLMs.

-- 4. Handling of ethical considerations --

The authors actively engage with ethical considerations around working with an endangered language. Their approach in working with the Kalamang language community is a model for how other NLP researchers should proceed.

缺点

While the topic of interest and proposed dataset are novel contributions, my main concern with the paper is the lack of innovation in terms of modelling and evaluation. Furthermore, I believe more should be done to prove that this task is truly distinct from standard extremely low-resource translation. This could be shown empirically through experiments comparing Kalamang translation to translation involving other (somewhat online) extremely low-resource languages.

-- 1. Narrow modelling comparisons --

The experiments would be improved by comparing other types of models besides recent LLMs. Sequence-to-sequence PLMs like mT5 have been shown to perform well on low-resource MT (https://aclanthology.org/2022.naacl-main.223.pdf) .

Furthermore, the nature of MTOB could be leveraged by more specialised neural architectures, such as neural MT models that incorporate bilingual dictionaries (https://aclanthology.org/2021.acl-long.382/, https://aclanthology.org/2020.acl-main.143.pdf). An analysis spanning different models would reveal more about the true difficulty of the task. There is a growing literature evaluating LLMs for low-resource MT (e.g. https://arxiv.org/pdf/2309.07423v1.pdf) and so far it seems that they fall short of massively multilingual NMT models, which could just be because they are trained/tuned for very different types of tasks.

-- 2. Insufficient interpretable evaluation--

While some qualitative examples are provided in the appendix, the paper would be strengthened by more such analysis. Could error types across models / in-context settings be quantified to some extent? More generally, since MTOB is framed as a unique benchmark, I would expect some interesting findings from a more nuanced evaluation framework (along the lines of page 7, paragraph 3 “In contrast, the grammar book…”). Such analysis could help motivate why MTOB is unique (e.g. if it is found that LLMs make certain types of errors on MTOB that they do not make on other tasks).

-- 3. Lack of comparison to extremely low-resource MT--

While any knowledge of the Kalamang language is new to the LLMs, the task itself (translation) is well known to LLMs given the wide availability of parallel corpora online and the popularity of machine translation as a task. The authors discuss this distinction themselves (fluid vs crystalised intelligence). However, there is some doubt as to how different English to Kalamang translation is in terms of task difficulty, compared to translation involving other extremely low-resource languages that have a limited online presence. This would call into question the value of MTOB as an NLP resource, as it is currently being claimed in the paper.

For example, it seems that for some extremely low-resource languages LLMs have basically no translation capabilities (see some of the zero-shot experiments here https://arxiv.org/pdf/2309.07423v1.pdf), even thought these languages are included in publicly available test sets. The paper would be improved through some comparison of the MTOB task with extremely low-resource MT e.g. qualitative differences in model performance or proof of data contamination even for extremely low-resource languages.

问题

  1. Was any part of the translation train/test set previously released along with the Kalamang grammar book?
  2. Did you test any other type of baselines (e.g. sequence-to-sequence models) on the translation task?
  3. Have you considered using chrF++ as your primary metric since it (arguably) enhances automatic evaluation?
评论

Thank you for reading our paper and for your constructive feedback! We have improved our paper based on your feedback.

With regard to the strengths:

  • Re Interesting findings: For us, the most interesting result was that Claude 2 with long context did so well when given 100k tokens of the book. This is not something that we expected given recent work that is somewhat negative about long context LLMs (e.g. Lost in the Middle; Liu et al. 2023 -- https://arxiv.org/abs/2307.03172).

With regard to the weaknesses:

  • We agree that we should have included a traditional machine translation method as a baseline. We have run baselines finetuned on the ~1100 available parallel sentences and included full details along with a table of results in the general response to all reviewers. We also agree that future work should explore other kinds of architectures/LLMs finetuned specifically for machine translation; we think an interesting research question is how to most effectively combine traditional massively multilingual machine translation data with instruction tuning/in-context learning capabilities. As you point out with the Robinson et al. paper, even though LLMs are trained on both kinds of data, the MT capabilities don’t always make it through.
  • We agree that more interpretable evaluation is always better. Unfortunately analyzing the results in this way, especially across different kinds of retrieved context, is very time-consuming and error-prone since there is a lot of context retrieved for each example (as in Figures 7 and 8, let alone the 50K and 100K chunks). It also seems safe to say that we can’t automate this analysis using current LLMs, since they are the ones generating the outputs with errors. We think that the qualitative examples and analysis in the paper give a good sampling of the kinds of errors that the models and retrieval methods make, and the chrF scores match our subjective impressions of output quality well. The fact that the errors are qualitatively similar to hallucinations in LLMs more generally and that there are fewer errors with better LLMs—despite the exclusion from pretraining data—is evidence that in-context learning in LLMs is really improving and underscores the value of MTOB.
  • We have conducted a comparison to low-resource MT and included the results in the response to all reviewers. But we will also clarify: we do not think that we claim in the paper that English <--> Kalamang translation is more difficult than translation with other low-resource languages. (Certainly, given equal amounts of data, we have no reason to think this would be the case.) Rather, we were trying to say the following: a) We can be confident there isn’t a mass of in-domain training data hidden in the pretraining set, unlike languages with more speakers, so we can test data efficiency in a principled way. The fact that there are only ~1600 parallel sentences means that yes, it is probably harder from a traditional NMT perspective, but b) We have the model learn from explanations in a grammar book to make the task easier, and c) We show that the task is actually possible to a great extent using the human baseline. So in some sense we are saying our task framing is actually easier than traditional low-resource MT from parallel sentences, because we have reason to believe that the task is possible.

With regard to the questions you asked:

  1. Yes, we mention in the expanded Limitations section in the Appendix (page 22 under “Potential Contamination”) that the train/test parallel sentences are in a CSV file on GitHub as of February 2021, as part of the source code for Dictionaria. (And we mentioned in the introduction the steps we are taking to prevent additional contamination due to the release of this paper.) It is possible that LLMs were trained on it, or maybe it was filtered out because the CSV file is very messy. Regardless, we saw no signs that the model baseline outputs were memorized; they changed very reasonably with different kinds of provided context.
  2. See the new baseline mentioned above that we ran in response to your feedback.
  3. Thank you for the suggestion. We have re-evaluated all results using chrF++ and we will add the results to the paper; see the included numbers in the table in the general response to all reviewers. We find that the (relative) chrF results are very similar to chrF, with a slightly bigger gap between the model-based methods and the human baseline.
评论

Thank you for responding to my queries and for reporting your new results, which will definitely add value to the paper. I have increased my overall rating and contribution score.

I still feel that, given the unique experimental opportunity presented by the new dataset, the work would have been strengthened by exploring innovation in modelling and/or evaluation.

Nevertheless, many researchers in computational linguists would be interested in the resources and findings of this paper. Furthermore, the benchmarking study is carried out excellently and the paper reads easily thanks to the great writing.

审稿意见
8

The work targets the OOD problem and the difficulty to assess it during training, especially when pre-trained models have seen very large swathes of open internet. Picking a domain where we can find an example scenarios that is rare on the web helps and they selected a low resource language with limited web presence. They create a benchmark which validate a model which has been trained by very few samples - in this case grammar content which is very high quality to demonstrate its value.

优点

The paper is very well presented and the core hypothesis is both clear and verified. Additionally, the paper does a good job of calling out the limitations of the work and future enhancements. The work itself feels motivated by a sincere desire to help communities who are disadvantaged due to their language being marginalized.

缺点

As the authors themselves list in limitations, though a specific dataset was chosen to enhance the model, at least part of the information might have leaked into the pre-training dataset, especially since there are other languages close to the source language which may have a larger presence.

The work would benefit from going beyond one sample to another language, especially since, as the authors state, there is a very high number of low resource languages. This would help the hypothesis and solution more convincing. The work otherwise comes across as a focused effort to provide more access to a disadvantaged group and then an attempt at generalization.

问题

How confident are you that the solution would generalize to other low resource scenarios?

There has been recent work (e.g. from Microsoft Research - phi models) on using high quality but smaller sample set to train a high quality model. How does your work compare to that?

评论

Thank you for your positive review and helpful feedback on our paper!

With regard to the weaknesses you mentioned:

  • Two clarifications on leakage:
    • First, it is only the test set where avoiding contamination is critical. If the training set is included in pretraining data, we agree this is unideal, but we can at least say that it was possible to learn to perform the task from this data. The main point is that by picking a language with so few speakers, we can be confident that there is not a mass of undetected in-domain training data scraped into pretraining datasets.
    • Second, in terms of related languages, we note in the Background section that Kalamang is not known to be closely related to other languages. In addition, the other languages in its family (which is just a geographical grouping, not genealogical) have similarly low speaker counts (100s-1000s).
  • We agree that it will be important to explore more languages in future work. That was out of scope for this paper due to the lengthy process of obtaining consent from the respective communities and performing the human baseline.

With regard to your questions:

  • Other than the fact that Kalamang can be tokenized conveniently, we see no reason why it should be easier or harder than other low-resource languages. We expect there to be more variation across the quality/comprehensiveness/presentation/length of different grammar books, which is an interesting topic for future work to explore. (What is the best way to organize grammar books for this application?)
  • We see Phi mostly as a distillation method; Phi 1.5 is mostly trained on synthetic data from a larger LLM (for 150B tokens), and ablations show that the synthetic data is key to the results. We expect that you could use an approach like Phi to produce a small Kalamang translation model: one could generate synthetic translation data using a large model with long resources provided in context, and distill that synthetic data into a much smaller task-specific model. But it is not a solution to the extreme data efficiency requirements of this task (which we expect will need in-context learning)
评论

Thanks for the detailed response. I retain my positive rating.

审稿意见
8

The authors propose an interesting approach to translation: having an LLM directly learn a language (Kalamang) from raw linguistics reference materials. They feed these documents to various large language models, in a variety of different ways (via additional pretraining, by providing extra context in the prompt, etc.) and evaluate the model’s ability to translate to/from Kalamang. An additional baseline is also provided in the form of a human who learnt the language from the same reference materials.

The approach is interesting in that it constitutes a modern take on rule-based MT, where instead of providing a structured grammar to a heavily feature-engineered model, the hope is that the language model will learn itself to interpret the human-readable descriptions. It is very valuable as a way of evaluating LLM performance on a clearly measurable, task-oriented benchmark which has direct real-world applications (translation of under-resourced languages.)

优点

  • The authors propose a novel, clearly measurable and well-defined evaluation benchmark of LLM translation capabilities.
  • The experimental setup is solid, and there is an extensive selection of baselines. The addition of a human baseline is particularly appreciated as it helps put numbers in context.
  • The authors recognise the risk of the reference materials leaking into LLM training datasets, and take active steps to prevent it.
  • The work is conducted with the involvement and consent of the language community it concerns.

缺点

  • A minor wish would have been to see some additional baselines involving more standard MT approaches. If we're saying that this work has the potential of helping with the translation of under-resourced languages, it might be worth trying, amongst others: standard neural MT, trained on the little parallel data that’s available + parallel data from related languages; traditional rule-base MT; neural MT trained on synthetic data generated from templates. I recognise that these approaches would likely be more involved than simply feeding data to an LLM, but they are all reasonable approaches that a researcher interested in building MT for Kalamang might try.

问题

  • How do you expect this approach to stack up against traditional MT (e.g. neural sequence-to-sequence MT with e.g. a transformer encoder/decoder architecture), when using techniques such as cross-lingual transfer from related languages, backtranslation, data mining, synthetic data augmentation?
评论

Thank you for your close reading of our paper and for your helpful feedback!

With regard to the weakness you mentioned, we agree that we should have included a traditional machine translation method as a baseline. We have run this baseline and included full details along with a table of results in the general response to all reviewers.

We also agree that it would also be interesting to evaluate more classical methods such as rule-based machine translation. We did not have time to design such a system in this short response period, but we will add a discussion of these methods to the paper. We think it would be especially interesting in future work to try to design a system combining rule-based translation and in-context learning from grammar books; we hope that our dataset can provide a robust means of evaluation in such cases.

With regard to your question, we expect that the benefit of data augmentation techniques scales with the quality of the model that generates the synthetic data. So if the number of available parallel sentences is small, the quality of synthetic data generated with a traditional finetuned NMT model will not be great, even with multilingual transfer etc. By contrast, if a long context LLM can learn to translate in context using a grammar book (and we know that humans can perform this task pretty well from the same data), then this could be used to generate much higher quality synthetic data to either self-improve or distill into a smaller model.

评论

Thank you for following up on my remarks and for conducting the additional experiments with several NMT baselines. Having read the other reviews, I do not see any of the listed weaknesses as deal-breakers. I am happy to confirm my positive rating for this paper.

评论

Dear All Reviewers,

Thank you very much for your positive reviews and insightful comments. We have responded to each of your comments individually and have posted a general response below.

The one additional baseline that both Reviewer M62r and Reviewer tEGh asked us to run was a comparison to standard neural machine translation using seq2seq models. We agree that this is an important baseline and we provide it here.

We finetuned a set of traditional machine translation models on our collection of 1134 parallel Kalamang-English sentences and compared their performance to our existing (in context learning) baselines. Since Reviewer tEGh mentioned the mT5 model specifically, we evaluated both the mT5 and FLAN-T5 series of models. In addition, for the sake of comparison, we evaluated the FLAN-T5 model using our in-context approach (note that this is only possible with FLAN-T5 because only FLAN-T5 is instruction-tuned).

We show the results of finetuning compared to in-context learning in the table here: (this response had to be split into 3 due to the OpenReview character limit)

评论
ModelConditioningchrF (K to E)chrF++ (K to E)chrF (E to K)chrF++ (E to K)
mT5 Small (traditional)Traditional MT Baseline15.813.43.72.8
mT5 Base (traditional)Traditional MT Baseline20.417.73.92.9
mT5 Large (traditional)Traditional MT Baseline23.920.719.817.3
mT5 XL (traditional)Traditional MT Baseline34.231.527.124.8
FLAN-T5 Small (traditional)Traditional MT Baseline23.420.518.315.5
FLAN-T5 Base (traditional)Traditional MT Baseline26.123.220.117.6
FLAN-T5 Large (traditional)Traditional MT Baseline30.828.225.422.4
FLAN-T5 XL (traditional)Traditional MT Baseline35.232.930.828.1
FLAN-T5 BaseNo Retrieval Baseline14.311.914.611.4
FLAN-T5 SmallNo Retrieval Baseline12.010.014.410.9
FLAN-T5 XLNo Retrieval Baseline13.610.615.412.1
FLAN-T5 XXLNo Retrieval Baseline14.311.115.712.3
LLaMA-2 7BNo Retrieval Baseline13.511.211.39.0
LLaMA-2 13BNo Retrieval Baseline14.712.212.910.5
LLaMA-2 70BNo Retrieval Baseline16.313.515.412.3
Claude 2.0No Retrieval Baseline15.612.613.010.6
GPT 3 (davinci)No Retrieval Baseline14.812.16.95.7
GPT 4No Retrieval Baseline10.89.018.815.0
FLAN-T5 BaseWordlist, Sentences, Book Passages16.113.213.210.5
FLAN-T5 SmallWordlist, Sentences, Book Passages12.410.316.713.2
FLAN-T5 XLWordlist, Sentences, Book Passages16.513.517.314.4
FLAN-T5 XXLWordlist, Sentences, Book Passages25.722.819.916.8
LLaMA-2 7BWordlist, Sentences, Book Passages19.617.514.211.8
LLaMA-2 13BWordlist, Sentences, Book Passages30.427.031.827.9
LLaMA-2 70BWordlist, Sentences, Book Passages37.834.938.034.3
Claude 2.0Wordlist, Sentences, Book Passages42.839.740.836.2
GPT 3 (davinci)Wordlist, Sentences, Book Passages38.435.427.923.9
GPT 4Wordlist, Sentences, Book Passages38.435.739.935.7
Claude 2.0Book (100k tokens), Sentences44.741.845.839.8
HumanHuman52.649.557.453.2
评论

We see that finetuning on the small number of available parallel sentences yields results which are reasonable, but not nearly as strong as those obtained from in-context learning.

This is consistent with results observed in previous work, such as Adelani et al. 2022 [1], which Reviewer tEGh cited in their response. In Adelani et al. 2022, the least-spoken language has 13K parallel sentences (more than 8x Kalamang) and 1M speakers (more than 5000x Kalamang), and yet the method reaches only 28.5 chrF on Ghomálá → French translation. Of course, chrF scores are not exactly comparable across languages, but this low score still demonstrates the need for large amounts of data in traditional machine translation.

In our case, with 1134 parallel sentences, finetuning reaches 35.2 chrF kgv-to-eng and 30.8 chrF eng-to-kgv. By contrast, we reach 44.7 and 45.8 chrF by performing in-context learning on 100K tokens from the grammar book, and we show that at least 52.6 and 57.4 chrF are possible by a human.

We will integrate these baseline results into the paper. Once again, thank you for your helpful suggestion.

[1] A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation (Adelani et al., NAACL 2022). https://aclanthology.org/2022.naacl-main.223.

AC 元评审

This paper investigates the ability of large language models to adapt to new tasks that are unseen in the training sets, and proposes to study machine translation to a low-resource language given only one book of grammar explanations. The paper examines a set of baselines for this purpose, and the authors add to this set following the suggestions from the reviewers. The work demonstrates that while the performance of LLMs is promising for this task, their performance falls short in comparison to a human who learned this specific language using the same resources. While it is concerning that no new method is proposed in the paper, the work is interesting and is expected to initiate good discussions and inspire future work.

为何不给更高分

The paper is very interesting, but no novel methods are proposed.

为何不给更低分

All reviewers and I agree that this is a good paper for presentation at ICLR. Furthermore, authors have included an extensive set of baselines in their work.

最终决定

Accept (spotlight)