PaperHub
6.8
/10
Poster4 位审稿人
最低6最高7标准差0.4
7
7
7
6
3.0
置信度
COLM 2025

Probing Syntax in Large Language Models: Successes and Remaining Challenges

OpenReviewPDF
提交: 2025-03-22更新: 2025-08-26
TL;DR

This work evaluates syntactic representations in LLMs using structural probes. We assess these probes across three benchmarks, revealing that their accuracy is compromised by linear distance and syntactic depth, yet remains invariant to surprisal.

摘要

关键词
SyntaxLLMsProbingEvaluation

评审与讨论

审稿意见
7

The paper proposes an investigation about how the syntactic information obtained from structural probes on embeddings from LLMs layers are biased or affected by specific syntactic representations in the analysed sentences. The work is important to better understand ho LLMs "understand" syntax and syntactic phenomena in the sentence and where this knowledge is embedded.

接收理由

The results highlighted by the paper provide more detailed hints on both the information that structural probes are capable of obtaining from LLMs, as well as how some syntactic constructions are difficult to be understood by LLM.

拒绝理由

Some parts of the paper are not very detailed and this could made the paper difficult to be read and understand by non domain experts.

给作者的问题

The sentence at line 37: "However, the performance of such Syntactic Probes has only been assessed with aggregated scores over large naturalistic corpora." doesn't includes some some studies, where detailed tests over sentences which include specific syntactic phenomena peculiar of some languages have been selected. In example, two studies that tested subject omissibility and subject-verb agreement, building tailored senteces sets to verify if and where the LLMs' layers encode the knowledge about these kinds of sentences, are presented in:

Guarasci, Raffaele, et al. "BERT syntactic transfer: A computational experiment on Italian, French and English languages." Computer Speech & Language 71 (2022): 101261.

Guarasci, Raffaele, et al. "Assessing BERT’s ability to learn Italian syntax: A study on null-subject and agreement phenomena." Journal of Ambient Intelligence and Humanized Computing 14.1 (2023): 289-303.

You should mention these studies and improve the sentence at line 37, taking into account such kind of studies.

Period is missing at the end of line 47.

Line 90: you should highlight that long and complex dependencies are very difficult to be understood also by humans. Although in constructions with relative clause the distance between pronoun that introduces the subordinate and antecedent could be virtually increased to infinity still producing a grammatically correct sentence, the results sentence becomes very complicate and slightly unusual (such the sentence at line 90: "The cat believes that the fox expects that the mouse moves").

Line 91: what is the generative grammar used? Add the detail and a reference.

Line 130: Why the details of the polar probe are included in an appendix, while structural probe's details are included in the main text. Considering the most recent publication of the polar probe (2024), maybe it could be more interesting than well-known structural probe (2019). Maybe, both descriptions should be included in the main text of the paper.

Line 139: some brief detail of Mistral-7B-v0.1 LLM should be included (total number of layers, architecture, language, pretraining corpus, etc.).

Line 143: how did you set the hyperparameters vaules of for the training of the probes?

Line 157: You should introduce and present the surprisal concept, also specifying how it is exploited in your evaluation.

Line 161: Why do you adopt Classifiers to predict output for ground truth positive instances? This point is not very clear to me and more details and a deeper explaination is required in the paper.

Results: some sentences' examples from your datasets should be used to better explain and support the results obtained (used for a sort of qualitative explaination), allowing to better understand the results of your study. Why don't you include them?

评论

We thank Reviewer tL7W for their insightful comments.


Related work

Clarification at L37

“However, the performance of such Syntactic Probes has mostly been assessed with aggregated scores over large naturalistic corpora.”

Added citations at L64

“Importantly, previous work has studied generalization of structural probes using both controlled and cross-linguistic stimuli (Guarasci, 2022; Guarasci, 2024; Kulmizev, 2020; Müller-Eberstein, 2022).”

L47: Typo corrected.


Dataset design

Addition at L93

“…introducing varying levels of syntactic complexity. Three nestings are already unusual in natural language and difficult for humans to understand.”

L91
Details of the generative grammar will be placed in the appendix of the camera-ready version of the paper.


Methods

L130
The section temporarily moved to the appendix (due to space constraints), it will return to the main text in the camera-ready version.

Model description added at L139

“…of Mistral-7B-v0.1 and Llama-2-7b-hf, transformer-based LLMs with 32 layers and 7 B parameters (Jiang et al., 2023; Touvron et al., 2023).”

Hyperparameter note at L143

“For the Polar Probe, the hyper-parameter λ was set to 10.0, as specified in (Diego-Simón, 2024).”


Surprisal & classifiers

Definition added at L157

Surprisal, in causal language models, quantifies the information content of a word wᵢ via its conditional probability given its previous context (Shannon, 1948).”

Clarification at L163

We aim to study and characterize the failure cases of probes, for that, we use existing dependencies in naturalistic and controlled sentences. Then, we model which factors (linear distance, depth, surprisal) best explain the success/failure of the probes in predicting those dependencies.

We clarify this:

“These classifier models reveal which properties of a dependency (linear distance, syntactic depth, surprisal ...) are most relevant for the probe to predict it correctly.”


Examples

Excellent idea. We will add didactic examples in the Appendix of the camera-ready version of the paper.

Prepositional-phrase (PP) sentences with one, two, and three levels of nesting (no fillers):

  • 1 nesting: The grandmas beside the neighbour laugh.

  • 2 nestings: A tailor by the bakers past some waiters jumps.

  • 3 nestings: The soldiers around the baker beside the guest behind a neighbour smile.


评论

The authors clarified and further improved the paper, confirming the initial evaluation ("Good paper, accept"). I am satisfied by these replies. Moreover, I have also checked the other replies and comments from the other reviews, which confirmed the improvements of the paper which can be expected in the final version.

审稿意见
7

This paper compares three (two + a baseline probe) syntactic structural probes on Mistral representations of naturalistic and curated constructed sentences.

I'm not confident that I properly understood the paper, which is reflected in my review

接收理由

The research question is important. Researchers typically take the outputs of these structural probes at face value, but this paper instead looks at what biases the probes themselves may introduce.

拒绝理由

A major point of confusion for me, which is also criticism, is the use of Mistral as the gold standard. If the authors are working to understand the biases introduced by the two probes, then they must assume that Mistral has the "correct" representations, an any effects that they observe are deviations from this correct representation. It's the reverse of the problem that the authors are addressing (at least in my understanding) where work assumes that the probe is accurate and any deviations are the result of the underlying model. Wasn't clear to me how the authors. For example, is the sensitivity to linear distance (Fig 2 BL) primarily a property of Mistral or of the probes?

Related to that, I think it would be particularly important to repeat this study with other LLMs, which maybe could help differentiate

I found the interpretation of Fig 3 kind of odd, or at least the description of it. The argument is that since head depth and linear distance dominate in feature importance, "these results reinforce our earlier findings, suggesting that structural properties of syntax primarily drive probe performance." However, "structural"/hierarchical distance is usually contrasted with linear/string distance, so this description of the results feels misleading.

评论

We thank reviewer Reviewer qy93 for their constructive review,

1  Model choice

We agree that extending the findings to other LLMs is an important point.
Accordingly, we have run experiments on Llama-2 7B and BERT-large.

  • We replicate the negative impact of linear distance and syntactic depth on probe performance for both models.
  • Feature importances on Llama-2 7B mirror those previously reported for Mistral-7B: surprisal does not predict probe failure.
  • Figure 4 is reproduced with both new models. Interestingly, BERT-large shows better generalization with respect to syntactic depth and attraction effects, likely owing to its acausal nature.

All results are available at the anonymized URL:
https://drive.google.com/file/d/12Mi8SOlB6FqJD1eL6el3UR5SliNYoVTd/view?usp=sharing

We will integrate these results into the Appendix of the camera-ready version of the paper.


2  Interpretation of Fig. 3

We thank Reviewer qy93 for highlighting the need for a clearer interpretation.

We replace the sentence with:

“These results reinforce our earlier findings, suggesting that probe performance is most impacted by both syntactic depth and linear distance.”

评论

I am satisfied by these replies and have updated my score accordingly

评论

We thank Reviewer qy93 for their time and effort evaluating the additional results.

评论

Yes. Thank you for the detailed response. I have updated my score.

审稿意见
7

This paper examines the performance of structural probes on three benchmarks that the authors have composed, with the goal to investigate how structural and statistical factors systematically affect the probes’ performance. The benchmarks are constructed regarding three syntactic strucrtures: PP, CE, and RB, each of which contains three nesting levels {1,2,3}, indicating the difficulty for correctly probing the syntactic relation between subject and verb. The main findings reveal that the probe performance is sensitive to structural features such as distance between words and syntactic depth, rather than lexical-level statistics, such as surprisal. There are similarities between probes and human parsing errors.

接收理由

  • The research question is well-motivated and clearly defined.
  • To explore what kind of syntactic knowledge is learned by LLM meaningful to linguistic theories.
  • The benchmarks collected will be useful for future work.

拒绝理由

  • The writing of the later part can be further polished. Some of the arguments in discussion are not solid, for example
    • What does “geometric structure syntax” mean at line 303?
    • At line 304, “probes are trained without any initial context” is obvious, right? If it is a sophisticated model that takes more context as input, it won’t count as a “probe”.
    • I do not see why the paper ends with a reference to prompting papers (e.g., Kojima et al., 2022). Maybe I missed somthing here. I knowthat strctural priming is about how syntactic structures that have occurred in previous context tend to occur later more likely, and how is this classicial problem investigated in LLM? Maybe the authors can provide some brief review in a related work section.

给作者的问题

line 12: and of => of line 177: perplexity => surprisal

评论

We thank Reviewer dAJ4 for their thoughtful comments.

The reviewer’s concerns focus on our Discussion section, which we have now thoroughly revised. All issues raised have been addressed, and we hope these improvements will be reflected in the updated assessment.


Discussion

Clarification added (L303): “geometric structure syntax”

The original phrase “geometric structure syntax” was ambiguous.
We have replaced it with:

“Alternatively, LLMs might encode syntactic trees on a non-linear manifold in hidden space, so linear probes would inevitably distort that structure.”

Prompting / Priming effects

In our experiments, we input sentences one by one to the LLM without previously specifying to focus on the syntax or the linguistics of the input sentence. Prior work shows that context can substantially alter LLM behaviour in generative tasks (e.g., improved subject-verb agreement). We aim to highlight the potential impact of prompting on latent representations, potentially improving probing results.

Clarification added (L304): “Prompting”

“As recently reported (Lampinen, 2024), prompting LLMs affects their downstream behaviour. In this study, we input the sentences one-by-one without any prior task context. Changing this setup might yield linguistically richer latent representations, potentially improving syntactic-probe performance.”


Typos

Corrected

评论

Dear Reviewer dAJ4,

Have our clarifications addressed your concerns regarding the Discussion section?

Thank you for your time and thoughtful feedback.

审稿意见
6

This paper presents a thorough investigation of structural probe and its extension, polar probe. It begins by evaluating both probes on naturalistic sentences and finds that linear distance and syntactic depth significantly affect the probe's ability to capture sentence structure, while word predictability does not appear to influence performance. Based on these observations, the paper proposed three controlled stimuli: prepositional phrase, center embedding, and right branching, and use them to further validate the observations. The results suggest that the structural probes are biased by the linear distance between two syntactically related words and the syntactic depth of the head.Additional factors such as congruent nouns and ungrammatical verb forms also influence probe performance. Moreover, the paper provides a analysis of structural probes and human parsing, offering insights into their similarities and differences. The authors propose to use controlled sentences as a future benchmark to evaluate the performance of structural probes.

In summary, this work presents several interesting observations and highlights the potential limitations of current structural probe design. However, I believe the paper would benefit from more comprehensive experiments and deeper discussion. In particular, I have concerns about whether the performance drop observed on the controlled stimuli is expected, which raises questions about the validity of the methodology and the strength of the conclusions.

接收理由

  1. The paper studies an interesting and important problem, understanding the success and limitations of structural probes of large language models. The experiments offer several valuable observations and insights for the future design of structural probes.
  2. The paper is well-organized, and the writing is generally clear and easy to follow.

拒绝理由

  1. Since this paper aims to examine whether structural and polar probes rely on superficial properties of sentences and to evaluate the robustness of these probes in capturing linguistic features, it is important to ensure the generalizability of the findings and the comprehensiveness of the experiments. I noticed that the experiments are conducted only on Mistral-7B-v0.1. Can the observation be generalized to other models such as LlaMa? Moreover, for the targeting probe, the related work mentioned other extensions of structural probes such as spectral probes.Why was the polar probe chosen for this study? It would be helpful to add a justification for this choice. Moreover, while the structural probe and polar probe are both evaluated in this paper, there is no in-depth discussion on comparing those two probes. For example, structural probe exhibits a larger performance drop on controlled sentences compared to the polar probe (as shown in Figure 2 Bottom Left). Is there any explanation for this difference?

  2. Please note that I am not an expert in syntactic structure analysis. In your experiments, how do you ensure that the construction of the controlled stimuli does not affect the original dependency tree and syntactic structure of the sentence? Additionally, the probes are trained on naturalistic sentences from UD-EWT, while the controlled stimuli may be considered out-of-distribution. If so, wouldn’t a performance drop be expected? As a benchmark, how do you determine that robustness to these controlled sentences is a desirable behavior? While the empirical results suggest some similarity between structural probe behavior and human cognition (Section 4.2), it would be helpful to include human evaluation on the controlled stimuli to further support this argument.

给作者的问题

  1. Line 149, is BpB_p^* the polar probe? It is not mentioned earlier in the main content.
  2. I noticed that the scale of linear distance in Figure 5 ranges from 0 to 50, whereas in Figure 2, it ranges from 0 to 10. Why does this difference occur?
  3. It would be helpful to include qualitative examples where the structural probe fails, particularly at longer linear distances or increased syntactic depth.
评论

We thank Reviewer D9Dc for their detailed and relevant feedback.


1  Model choice

We agree that extending the findings to other LLMs is an important point.
Accordingly, we have run experiments on Llama-2 7B and BERT-large.

  • We replicate the negative impact of linear distance and syntactic depth on probe performance for both models.
  • Feature importances on Llama-2 7B mirror those previously reported for Mistral-7B: surprisal does not predict probe failure.
  • Figure 4 is reproduced with both new models. Interestingly, BERT-large shows better generalization with respect to syntactic depth and attraction effects, likely owing to its acausal nature.

All results are available at the anonymized URL:
https://drive.google.com/file/d/12Mi8SOlB6FqJD1eL6el3UR5SliNYoVTd/view?usp=sharing

We will integrate these results into the Appendix of the camera-ready version of the paper.


2  Syntactic structures

The three constructions evaluated PP, CE, and RB each possess distinct dependency trees, which change with every nesting level. We always evaluate probe predictions against the sentence’s gold dependency tree, parsed by both linguists and automatic parsers.

  • PP, CE and RB are valuable because they contain subject-verb dependencies that disentangle linear distance from syntactic depth, confirming the results from the UD-EWT corpus (Fig. 2, top).
  • Their single-nesting variants occur frequently in natural language, so they are not out-of-distribution for the probes’ training data.
  • Replicating the findings in UD-EWT and observing strong attraction effects for single PP sentences (Fig. 4, top-right) shows that performance gaps reflect a real limitation, not a train-test mismatch.

3  Similarity with humans

Section 4.2 draws several parallels with human cognition such as the effects of ungrammaticality and primacy/recency, but two caveats apply:

  1. Psycholinguistic studies typically track behaviour (production or acceptability),
    whereas probes infer syntactic trees directly from an LLM’s internal states, a task conceptually closer to decoding syntactic parses from brain activity.
  2. This methodological gap may explain the sharper decline in probe accuracy as linear distance and depth increase.

Bridging this gap is an interesting direction for future research.


4  Choice of probes

Our aim is to probe distance-based syntactic codes, originally introduced via the Structural Probe.

  • The Polar Probe extends this by encoding dependency types with angular information while retaining the distance-based component.
  • We focus on distance-based, linear syntactic probes, deliberately excluding non-syntactic (Spectral) and nonlinear (Hyperbolic, Action, MLP) probes.

Clarification added (L136): Choice of probes

“We extend the study to the Polar Probe since it augments Hewitt & Manning’s framework with additional syntactic information while preserving a linearly readable, distance-based syntactic code. See Appendix A for implementation and training details.”

New observation added (L233): The Polar Probe generalizes better to linear distance

“When comparing the Structural and Polar probes, we find that the Polar Probe exhibits a slower decay in accuracy with increasing linear distance (Fig. 2, bottom-left). We hypothesize that its more constrained objective, including additional syntactic information, encourages learning more robust syntactic representations, relying less on surface heuristics.”

Figure 4 has now been replicated with the Structural Probe as well (see the anonymized URL above).


5  Responses to specific questions

  1. Added at L135: “The Polar Probe (BPB_{P}^{*}) ...”
  2. We retain bins 1 – 10 to align scales with the controlled dataset (figure just below) and ensure sufficient data points (note the log-scale y-axis in Fig. 5).
  3. We will add failure examples in the Appendix of the camera-ready paper.

评论

Thank the authors for their detailed explanations. They have resolved most of my concerns, and I have updated my score accordingly.

评论

We thank Reviewer D9Dc for their time and effort evaluating the additional clarifications and results.

最终决定

The reviewers largely positively evaluate this work and its experiments testing the behavior of structural probes across various attributes of syntactic dependencies like parse depth and linear distance. The pros of this work are in running controlled studies of the behavior of a well-known language model analysis tool on modern language models. Reviewers generally approved of the paper’s clarity. For cons, see below.

I won’t override the unanimous opinions of the reviewers, so I’ll recommend acceptance. I want to provide some editorial thoughts for the authors, though, since I’ve spent a good amount of time thinking about structural probe research.

After reading the paper myself, I am left somewhat confused – does the decay of the structural probe’s parse accuracy show that the probe is failing, as claimed by the authors? Or is the probe accurately reflecting the underlying language model’s failure to parse? A probe’s parsing failure is not necessarily a failure to give insight into the model, if the model is just failing to build the tree correctly. Arguably, modern models almost never make syntactic errors, so maybe the premise of this work is that a good probe should reflect perfect parsing, and thus not decay in accuracy. This is discussed by Reviewer qy93, but unlike reviewer qy93, I do not think this concern is mitigated by evaluating more models. For, e.g., the 3-depth center embedding synthetic data, with no semantic cues, is really confusing for humans and machines both (also noted by Reviewer tL7W), so it’s plausible to me that models really do just fail to (implicitly) parse them correctly. At a higher level, I’m just not sure what the successes of the probes are (success shows up only in the title) and whether the failures of parsing accuracy are probing failures, or language model failures that are correctly captured by the probes. One way to study this would be to compare probe results to behavior, e.g., when the probe gives an incorrect parse, does the model’s behavior show corresponding inaccuracies? If so, the probe is correctly reflecting the model, otherwise it’s a failure.

This is a nit, but I’m confused that the authors stated “red ideas sleep furiously” as an “infamous” Chomsky quote – the cited work (Chomsky, 1957) has “colorless green ideas sleep furiously,” so the authors should use that quote, if only to be faithful to the cited work.