PaperHub
7.3
/10
Poster4 位审稿人
最低6最高8标准差0.8
8
6
7
8
2.5
置信度
COLM 2025

UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?

OpenReviewPDF
提交: 2025-03-20更新: 2025-08-26
TL;DR

The study presents a linguistic features-based annotation of Linguistics Olympiad puzzles to find LLM weaknesses; LLMs are bad at puzzles with higher morphological complexity, dissimilar to English, and when the puzzle is data constrained.

摘要

关键词
linguistic reasoningmetalinguisticsLLM evaluationmorphologylinguistics olympiadinterpretabilitylow resource languagesannotation

评审与讨论

审稿意见
8

Edit: Score increased after rebuttal


Summary:

The authors analyze the failings of LLMs on linguistic reasoning problems based on the linguistic content of the puzzles. They find that puzzles sharing features with English are easier, and puzzles with more morphological complexity are more difficult. This indicates the potential for better tokenizers as a step towards better performance.


Originality:

This is a new method for understanding the performance of LLMs on linguistic reasoning problems. Applying linguistic concepts to the problems and identifying patterns offers a new insight on the poor performance of most models.

Quality:

The paper seems to be methodologically sound. The core of the work is the re-annotation of the LO problems, which is accomplished by two expert annotators with good inter-rater agreement.

Clarity:

The writing is passable but not excellent. At many points, it is difficult to understand what the authors are referring to (e.g. L97 - “LLM performance explanations in literature” could apply to both a wide and narrow scope. WALS is also never defined/explained.) The writing is also imprecise in many places (e.g. L56 “and more”, L212 “morph”) and has grammatical errors which could confuse the reader (e.g. L270 is not a sentence).

The Figures should also be clarified. I don’t understand why the authors separated Figures 8 and 20. It would seem natural to add a third bar for @ separated morpheme in Figure 8. Readers should not need to refer to the appendix for results which are discussed in the main body (L298). Figures 5,6,7,9 would probably make more sense as a single larger figure which shows the process of annotation on a single puzzle.

The rendering of the Figures is also blurry on my screen. I appreciate this may not be the authors’ fault, but it is worth checking.

Significance:

The results in this work have moderate significance, but the paper lacks somewhat in extension beyond implications for solving LO puzzles. The result about English bias is unsurprising, and would be more useful if it was extended to consider models with a higher emphasis on multi-linguality to show whether the bias is ameliorated (e.g. Aya/Command A or Gemma). The tokenization result is exciting and has the potential to be impactful, but it would be useful to connect to LLM reasoning/performance more generally, as linguistic reasoning itself is not a particularly common application. This connection seems quite possible with examples such as “counting the ‘r’s in ‘strawberry” coming from potentially similar issues.

接收理由

  • This is a methodologically-sound and principled paper, which connects the abilities of LLMs on linguistic reasoning tasks to established categories of linguistic phenomena.
  • The implications, particularly the relevance of tokenization for task performance, are useful for improving technical methods.

拒绝理由

  • It is unclear whether the authors have controlled for known causes of task difficulty (number of reasoning steps) when identifying new areas (number of morphological features). The newly identified features may be highly correlated, or not a separate phenomenon at all.
  • The significance is limited to a relatively small area of research with limited discussion of wider implications.
  • The experiments do not include state of the art models when trying to explain patterns in model performance.
  • The writing of the paper itself is casual and borders on sloppiness.

给作者的问题

Critical Issue:

  • Do the authors control for problems simply having more features of any kind? We know that LLMs struggle with multi-hop reasoning, are morphemes specifically a separate phenomenon? (A solid answer to this question would suffice for me to weakly accept the paper.)

Additional Experiments:

  • Why do the authors not consider Claude models, which are the highest scoring on both LingOly and Linguini by a considerable margin?
  • Would inference-time compute models show similar patterns in performance?
  • For the tokenization experiment, introducing whitespace means that the resulting tokens will mostly be beginning of word tokens (e.g. “_lucky” vs “_un”+”lucky”). Would it be better to manually tokenize the words with the boundaries in the correct places? (At least for the open models).
  • Does the English bias reproduce when using models with a deliberate focus on multilingual training data and tokenizers?

Interpretation:

  • For the tokenization experiment, a similar approach was attempted in Khouja et al. (LingOly-TOO) which reported little benefit for the models. What explains the higher success in this case?
  • What would the authors consider to be the primary benefit of this work? If we assume that solving LO puzzles is not intrinsically useful, and that the puzzles are probably not the most efficient way for LLMs to learn LRLs, what larger area of research will benefit from these results?

Artefacts:

  • Is/will the annotation data be available for future use?
  • The ethical statement should state that the authors complied with all relevant licenses and intellectual rights. “Publicly-available” does not provide a presumption of ethical access.
评论

We thank Reviewer TCNB for their thorough and constructive feedback. We appreciate the detailed evaluation and the opportunity to address the feedback. Below we address each of the concerns raised, clarifying our motivation, methodology, and revisions for the camera-ready version:

  • Critical Issue (Reasoning and Complexity Controls): We appreciate this excellent observation regarding the decomposition of puzzle complexity and characterization of LO puzzle difficulty. Our study also focuses on this challenge by using linguistic features as proxies for complexity, given that LO puzzles are inherently difficult to decompose into reasoning steps or atomic units. As established by Bozhanov et al. (2013), these puzzles "are designed to be solved by deducing linguistic patterns, with other sources of technical complexity increasing difficulty without making them linguistically interesting". Our findings demonstrate that linguistic features exhibit at least a linear correlation with LLM performance. Future work could explore potential super-linear relationships between features and performance, as linguistic patterns across different domains (morphology, syntax, etc.) become increasingly intertwined in more challenging puzzles, such as those in the International Linguistics Olympiad. We will include this motivation for breaking down LO puzzles’ reasoning into features in the camera-ready version for clarity as well.
  • Significance of the work: This work represents a foundational step toward probing and enhancing linguistic reasoning capabilities in LLMs. These are crucial because human proficiency in linguistic reasoning and metalinguistics contributes significantly to efficient learning from limited input. Following the frameworks and goals established by Mahowald et al. (2023), McClelland et al. (2020), Beguš et al. (2023), Şahin et al. (2020), and Marcus et al. (2020), improvements in LLM metalinguistic processing capabilities can enhance general performance across diverse tasks. Our work proposes an uncontaminated method for probing these abilities, with findings that can inform more efficient language learning in LLMs and improve their general capabilities.
  • State-of-the-art benchmarking: Our model selection prioritized diversity across model families, parameter counts, and reported performance on generic reasoning benchmarks within strict budgetary constraints, rather than maximizing benchmark scores. Consequently, we could not include the Claude family in our initial analysis.
  • ITC models: We recently gained access to DeepSeek-V3 and DeepSeek-R1 (ITC model), with the latter demonstrating superior performance while maintaining our observed correlations between model performance and morphological features (Pearson correlation = -0.74***, p < 0.001). These results will be included in the camera-ready version: | Prompt | Prev. Best | R1 | |-----------------------|:--------------:|:------:| | Null Prompt | 47.4 | 60.7 | | ModeLing Minimal | 53.1 | 61.4 | | ModeLing Hand-Tuned | 49.2 | 64.7 | | ModeLing Basic-CoT | 48.7 | 64.9 | | ModeLing Full-CoT | 49 | 60.5 | | LingOly Std | 48.3 | 61.9 |
  • Tokenization Experiment: You are correct in assuming that tokenization does not necessarily match morphology-based splitting of words, which is why we added whitespaces manually (Line 280-281: "annotate"). We will clarify this explicitly in the camera-ready version.
  • Comparing Tokenization Results: The divergence from LingOly-TOO findings stems from methodological differences. While LingOly-TOO employed automatic tokenizers for character-level processing of unseen languages, our approach utilized morphologically-informed word segmentation based on language-specific rules, resulting in substantial improvements.
  • English bias: We acknowledge your important concern about the English bias possibly originating from monolingual models. However, our experiments involve models like GPT 4o, Llama 3.1, etc. which are multilingual. As we test over multiple such models, we observe the bias consistently present in a significant way.
  • Multilingual models: You have also pointed out Aya and Gemma as models which are explicitly trained on more balanced multilingual data. We acknowledge that these would be good future additions to our line of study, as they may present interesting results. We will consider them in our future work.
  • Data release: We plan to release all the annotations (mappable to data from the publicly available datasets), code, and fixes to the data upon paper acceptance.
  • Ethics Statement: We will clearly state in the camera-ready version that we comply with all available licenses and/or have sought authors' explicit permission to use their datasets. We have made minimal changes and will publish only mappable annotations and a list of fixes for full reproducibility.

(contd. below)

评论

Thank you for your thoughtful responses. I will definitely increase my score. I have a few follow-up questions, but they are largely motivated by curiosity because I like the paper. I do not want to bury you in additional work, so please feel free to decide what you think is worthwhile to consider.

Critical Issue (Reasoning and Complexity Controls): I'm still a bit hesitant that we can't really distinguish here between puzzles which are hard because they require more steps and puzzles which are hard because they require more steps involving morphology. (That is, is morphology special, or is this just a case of finding the general pattern again?) Your argument that these puzzles are fundamentally constrained by linguistic features is well-taken, and it may be that in this case we can't make the distinction I'm asking about because there aren't significant non-linguistic reasoning elements involved in the puzzles.

Significance of the work: I think this is a good answer I would hope to see something like this reflected in the camera-ready version.

State-of-the-art benchmarking: I think this is unfortunate, as it would be useful to know if stronger models have different properties, but I entirely understand the need to balance cost and compute constraints and do not begrudge the authors making the decisions they did.

ITC models: This is a great results, and I'm particularly interested in seeing that the improvement in scores came with an even stronger correlation to the morphological features!

Tokenization Experiment: Sorry, let me clarify here. I believe I understood the experiment, but want to point out a small technical detail. To use an example, the GPT 4 tokenised (https://platform.openai.com/tokenizer) tokenises " establish" (with a whitespace) into token 5813, but the token corresponding to "establish" in " antidisestablishmentarianism" is token 34500. In general, there are often different tokens for word pieces with and without a prefixed space. By inserting spaces, the authors will have created a tokenisation using only the type of tokens with prefixed spaces. I wonder if the same results would be obtained (or if the models would do even better) if the tokens did not include initial spaces. (I think the @ breaks may have done a better job of attaining this result, though with the contrivance of inserting many @ signs.)

Comparing Tokenization Results: Thanks, this makes sense. You've added more (and high quality) information, so the results are much better.

English bias/Multilingual models: Thanks.

Data release: Great!

Ethics Statement: Excellent, thank you.

Others: Please do edit and tighten the text. Linguistic expertise is not necessarily widespread in the ML/LLM community, and higher quality writing will increase the accessibility and impact of your valuable research efforts.

评论

Thank you so much for increasing the score and for your continued engagement with our work! We truly appreciate your thoughtful follow-up questions and are excited (and not burdened) by the opportunity to discuss these important details with someone invested in the research:

  • Critical Issue (Reasoning and Complexity Controls): We appreciate the clarification and your concern about distinguishing morphological difficulty from general reasoning complexity. Our manual word-splitting experiment addresses exactly this question - by "solving" morphology for the LLMs (for some puzzles focused only on morphology) through explicit segmentation, we can isolate the difficulty due to non-linguistic reasoning elements. The substantial score improvements after morphological splitting demonstrate that morphology was indeed the primary bottleneck, while the remaining performance gaps in some puzzles reveal the contribution of non-linguistic reasoning components. We will clarify this analysis further in the camera-ready version and propose expanded experiments for future work to more thoroughly establish this distinction.

  • ITC models: We share your interest in this pattern! In fact, we observe this trend consistently across our models - larger and newer models show stronger correlations with morphological features. This could relate to "inverse-scaling" effects as shown by Hofmann et al. (2025), though we remain uncertain about the underlying mechanisms.

  • Tokenization Experiment: Thank you for the clarification and a concrete example! In fact, the effect of added spaces creating different token representations can be even more pronounced in morphologically rich, low-resource languages. We analysed this further by using your example and found that in Mixe, "mtunp" receives completely different tokenizations: without whitespaces [(12232, ' mt'), (359, 'un'), (79, 'p')] versus with whitespaces [(296, ' m'), (11716, ' tun'), (281, ' p')]. This highlights the importance of exploring what makes whitespace addition effective and whether superior segmentation methods exist - a valuable direction for future work.

For all other points, we acknowledge the importance of tightening our text for broader accessibility within the ML/LLM community and appreciate your thoughtful suggestions throughout this review process. Your insights have strengthened this work a lot.

Reference:

  • Hofmann, Valentin, et al. "Derivational morphology reveals analogical generalization in large language models." Proceedings of the National Academy of Sciences 122.19 (2025): e2423232122.
评论

(contd. from above)

  • Figures 5, 6, 7, 9: These figures are supposed to help readers understand the respective analyses and also give them an idea of what the puzzles look like in context (Figure 2 presents all parts of the annotation process). For academic structure and clarity, in the camera-ready version, we will either put them together in a single figure or use the LaTeX example format to present them inline.
  • Figures 8 and 20: We found that "@" separated morphemes yielded results comparable to whitespace separation, hence their exclusion. To address crowding concerns while maintaining clarity, we will present these as stacked bar charts in the camera-ready version.
  • Typos: We will fix the following in the camera-ready version: specify LO puzzles & linguistic competency (L97), add a footnote linking WALS to Dryer & Haspelmath et al., and change "correlation. Implying" to "correlation, implying".
  • Casual writing: Finally, we will pass the paper through another round of proof-reading to make it more formal and academically worded for the camera-ready version.

References:

  • Bozhanov, Bozhidar, and Ivan Derzhanski. "Rosetta stone linguistic problems." Proceedings of the Fourth Workshop on Teaching NLP and CL. 2013.
  • Mahowald, Kyle, et al. "Dissociating language and thought in large language models." Trends in cognitive sciences (2024).
  • McClelland, James L., et al. "Extending machine language models toward human-level language understanding." arXiv preprint arXiv:1912.05877 (2019).
  • Beguš, Gašper, Maksymilian Dąbkowski, and Ryan Rhodes. "Large linguistic models: Analyzing theoretical linguistic abilities of LLMs." arXiv preprint arXiv:2305.00948 (2023).
  • Şahin, Gözde Gül, et al. "PuzzLing machines: A challenge on learning from small data." arXiv preprint arXiv:2004.13161(2020).
  • Marcus, Gary. "The next decade in AI: four steps towards robust artificial intelligence." arXiv preprint arXiv:2002.06177(2020).
审稿意见
6

This paper looks at the performance of Large Language Models (LLMs) on language puzzles from Olympics Olympiads. The basic goal is to score puzzles for various linguistic properties and see which of those properties correlate with performance.

接收理由

The basic methods and results seem solid. If a reader is interested in these puzzles or in prompt engineering more generally, this paper will be of interest.

拒绝理由

There are way too many acronyms; these don't help the reader and made things hard to understand.

Not clear why is this an interesting/relevant problem. That is, what actual NLP task is this connected to?

The annotation is a bit mysterious. It looks like there was a lot of inter-rater reliability, but the parameters used were not fully explained.

There were some unexplained things. For example, the authors say they nnly consider puzzles from "MODELING (M) and LINGOLY (L)", but what are those?

评论

We thank Reviewer Qiv2 for their constructive feedback and recognition of our solid methodology and results. Below we address each of the concerns raised, clarifying our motivations, methodology, and planned revisions for the camera-ready version:

  • Excessive Acronyms: We will comprehensively revise the manuscript to reduce reliance on acronyms and provide clear definitions upon first usage to enhance readability in the camera-ready version.
  • Relevance: This work represents a foundational step toward uncontaminated probing and enhancing linguistic reasoning capabilities in LLMs. These are crucial because human proficiency in linguistic reasoning and metalinguistics contributes significantly to efficient learning from limited input. Following the frameworks and goals established by Mahowald et al. (2023), McClelland et al. (2020), Beguš et al. (2023), Şahin et al. (2020), and Marcus et al. (2020), improvements in LLM metalinguistic processing capabilities can enhance general performance across diverse tasks. Our immediate findings are relevant for Low Resource Language learning efficiency in LLMs, tokenization optimization, and more.
  • Annotations: The high inter-annotator reliability results from our constrained puzzle design, annotation framework with expert annotators, and well-defined linguistic features. Due to space constraints, we will detail the annotation guidelines in the appendix of the camera-ready version for transparency.
  • modeLing and LingOly: These datasets are introduced in Lines 63-69 and summarized in Table 1 before their reference in Line 72.

References:

  • Mahowald, Kyle, et al. "Dissociating language and thought in large language models." Trends in cognitive sciences (2024).
  • McClelland, James L., et al. "Extending machine language models toward human-level language understanding." arXiv preprint arXiv:1912.05877 (2019).
  • Beguš, Gašper, Maksymilian Dąbkowski, and Ryan Rhodes. "Large linguistic models: Analyzing theoretical linguistic abilities of LLMs." arXiv preprint arXiv:2305.00948 (2023).
  • Şahin, Gözde Gül, et al. "PuzzLing machines: A challenge on learning from small data." arXiv preprint arXiv:2004.13161(2020).
  • Marcus, Gary. "The next decade in AI: four steps towards robust artificial intelligence." arXiv preprint arXiv:2002.06177(2020).
评论

Good response.

审稿意见
7

This paper investigates how several LLMs can cope with linguistic puzzles (typically on very very low resources languages) used in Linguistics Olympiad Competitions.

On carefully constructed data set of 629 problems over 41 low resource languages, the models mostly struggle with complex morphology (unless explicit morphological segmentation is given). On the other hand, LLMs perform better with puzzles involving linguistic features that are also found in English.

接收理由

Overall nice work with insightful analysis.

Provides another carefully curated dataset for future use.

拒绝理由

Not much

给作者的问题

I wonder if you made distinctions between inflectional vs derivational morphology (or does NA Compound /comp -- cover this I think not but just wanted to check):

Can we assume that the tokenization dictionary before and after morpheme splitting were the same? If so, how much of an overlap were there between the morphemes and word-pieces (I am guessing not much?)

Minor comments

-- Pgh 234 it seems the language names have been swapped in Figure 6. No? -- 194 -- font problem ?

评论

We thank Reviewer iDKA for their positive recommendation and recognition of our work as nice with insightful analysis. We address the specific questions below:

  • Inflectional vs. Derivational Morphology: We agree that distinguishing inflectional from derivational morphology would be insightful. However, our LO dataset contains too few puzzles that isolate one type of morphological process, preventing reliable statistical analysis (NA/Compound do cover some of these, but it is a small fraction of the data). We will note this limitation and suggest it as future work.
  • Tokenization: We did not modify the underlying tokenizer’s vocabulary. Instead, we inserted whitespaces manually to force human‐informed morpheme boundaries (Line 280–281). Because the tokenizer still uses its original word‐piece dictionary, overlap between true morphemes and word‐pieces is incidental and typically low (with GPT-4, we observed only a 0.62 (out of 1) Jaccard similarity between the morphemes and the word-pieces). We will clarify this process in the camera‐ready version.
  • Language Swap: We will correct the swapped language names in Figure 6 in the camera-ready version.
  • Font Problem: Following the established naming pattern and font style (textsc) in this domain (modeLing, LingOly, Linguini, PuzzLing), we adopted "unveiLing" for our work. The section header at Line 194 removes the "-ing" suffix for grammatical agreement purposes.
评论

Thank you for your response

审稿意见
8

This paper presents a fine-grained analysis into LLM limiation on Linguistics Olympiad (LO) puzzles. The paper looks into why models struggle. The authors curate and annotate a dataset of 629 problems across 41 low-resource languages with 50 WALS-derived linguistic features and other attributes like similarity to English and data-constrainedness. Their correlation analysis reveals that LLMs perform worse on puzzles with higher morphological complexity and more data-constrained features, while performing better on puzzles exhibiting features also found in English. They also demonstrate that pre-splitting words into morphemes improves solvability, suggesting that tokenization of low-resource, morphologically rich languages is a key issue.

接收理由

  • Presents thorough experiments and extract important insights into the linguistic reasoning limitations.
  • Contributes to improvements for low-resource languages.
  • The created dataset is a valuable contribution (pending its release by the authors).
  • The paper provides convincing support for the choice of studying linguistic Olympiad puzzles, citing their utility in analyzing LLM performance on low-resource languages.
  • The paper consistently reports significance levels and provides sufficient discussion on the results.

拒绝理由

  • There are numerous (minor) presentation issues. Including the following:
    • Line 46 references the research question, which is never explicitly stated.
    • Line 47 references WALS before it is defined.
    • The caption of Figure 1 is non-descriptive and confusing for readers unfamiliar with the problem.
    • Line 131 is redundant.
    • "Feature" vs. "Attribute" are used inconsistently.
评论

We thank Reviewer D2sE for their thorough evaluation, strong acceptance recommendation, and recognizing the thorough experiments and important insights into linguistic reasoning limitations. We address the presentation issues below (and will fix them for the camera-ready version):

  • Research Question: We acknowledge this oversight and will explicitly state our research questions to provide clear direction for readers.
  • WALS reference: We will add a footnote linking WALS to Dryer & Haspelmath et al. and provide the full form at first usage to ensure proper introduction (World Atlas of Language Structures).
  • Figure 1 caption: We will enhance the caption to be more descriptive and accessible for readers unfamiliar with Linguistics Olympiad problems, providing sufficient context for understanding.
  • Attributes vs. Features: We use "Attribute" as an encompassing term that includes linguistic features alongside English bias and data constrainedness measures. We will clarify this distinction when "Attribute" is first introduced and conduct comprehensive proofreading to ensure consistent usage throughout the paper.
评论

Thank you for your response. I have no further questions at this time.

最终决定

This paper identifies properties of Linguistics Olympiad puzzles that cause particular difficulty for LLMs.

Pros:

  1. The paper provides an extensive set of experiments
  2. The experiments are carefully run and well-presented
  3. The paper involves a dataset that will be a useful contribution
  4. The problem is an important one with applications to improving technology for low-resource languages

Cons:

  1. Reviewers noted some presentational issues, such as fairly many acronyms and lack of clarity in stating the research question. I encourage the authors to revise the paper in the ways they have proposed to address these points.
  2. Reviewers note that more discussion of broader implications would be helpful.