PaperHub
6.0
/10
Poster4 位审稿人
最低3最高5标准差0.8
3
4
3
5
4.0
置信度
创新性3.0
质量2.5
清晰度3.0
重要性2.5
NeurIPS 2025

Dependency Parsing is More Parameter-Efficient with Normalization

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29

摘要

Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on semantic and syntactic dependency parsing in multiple languages, along with latent graph inference on non-linguistic data, using various settings of a $k$-hop parser. We train $N$-layer stacked BiLSTMs and evaluate the parser's performance with and without normalizing biaffine scores. Normalizing allows us to achieve state-of-the-art performance with fewer samples and trainable parameters. Code: https://github.com/paolo-gajo/EfficientSDP
关键词
semantic dependency parsingnormalizationbiaffine attention

评审与讨论

审稿意见
3

This work studies the effect of normalization on the scores produced by biaffine functions in dependency parsing tasks. The authors demonstrate, both theoretically and empirically, a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. Experimental results show that similar or better performance can be obtained by reducing the amount of trained BiLSTM parameters by as much as 85%.

优缺点分析

Strengths:

  1. This paper explores a problem that is overlooked. Biaffine function is a standard implement, and few people pay attention to it and its relationship with parameter quantity.
  2. Comprehensive experiments.

Weaknesses:

  1. Although this research question is very detailed and easily overlooked, I don't think this contribution meets the acceptance bar of NeurIPS.
  2. Are there consistent experimental results on other language datasets? Experimental conclusion should be verified on more languages.
  3. What are experimental results if add a Normal Layer?

问题

See the weaknesses.

局限性

yes

最终评判理由

First, I do not think the contribution well-suitable for NeurIPS. Second, even in the community of *ACL, the contribution is not strong enough. Might be better for a workshop oriented to syntax parsing.

格式问题

N/A

作者回复

Q.1: Although this research question is very detailed and easily overlooked, I don't think this contribution meets the acceptance bar of NeurIPS.

A.1: We agree with the reviewer that the topic of our contribution is indeed overlooked. Thus, our work is novel in terms of improving dependency parsing as an application. Furthermore, our contribution involves analyzing normalization phenomena in architectures used broadly in graph inference tasks (i.e., modeling edges as fully connected weighted graphs via attention mechanisms, such as biaffine transformations, see [1,2,3] for reference). To show this, we have repeated the experiment using Graph Attention Networks (GAT), using a stack of biaffine/GAT pairs to encode higher-order dependencies before the final biaffine layer. The results with this new architecture are consistent, as shown in the table below. Moreover, we have expanded the experiments to six multilingual settings as reported in our answer to the reviewer's next comment.

normLϕL_\phiCoNLL04ADESciERCenEWTSciDTBERFGC
000.526 ±0.046\pm 0.0460.517 ±0.038\pm 0.0380.123 ±0.050\pm 0.0500.610 ±0.009\pm 0.0090.727 ±0.006\pm 0.0060.536 ±0.013\pm 0.013
010.493 ±0.037\pm 0.0370.509 ±0.022\pm 0.0220.039 ±0.020\pm 0.0200.583 ±0.011\pm 0.0110.731 ±0.009\pm 0.0090.599 ±0.008\pm 0.008
020.485 ±0.033\pm 0.0330.509 ±0.025\pm 0.0250.029 ±0.041\pm 0.0410.540 ±0.012\pm 0.0120.710 ±0.013\pm 0.0130.611 ±0.015\pm 0.015
030.481 ±0.037\pm 0.0370.487 ±0.040\pm 0.0400.000 ±0.000\pm 0.0000.511 ±0.015\pm 0.0150.700 ±0.010\pm 0.0100.585 ±0.009\pm 0.009
100.574 ±0.028\pm 0.0280.556 ±0.037\pm 0.0370.156 ±0.031\pm 0.0310.671 ±0.007\pm 0.0070.778 ±0.006\pm 0.0060.607 ±0.009\pm 0.009
110.550 ±0.036\pm 0.0360.587 ±0.015\pm 0.0150.148 ±0.028\pm 0.0280.696 ±0.005\pm 0.0050.809 ±0.008\pm 0.0080.639 ±0.007\pm 0.007
120.563 ±0.029\pm 0.0290.534 ±0.023\pm 0.0230.128 ±0.029\pm 0.0290.657 ±0.007\pm 0.0070.795 ±0.008\pm 0.0080.637 ±0.007\pm 0.007
130.514 ±0.026\pm 0.0260.543 ±0.024\pm 0.0240.071 ±0.021\pm 0.0210.596 ±0.010\pm 0.0100.770 ±0.007\pm 0.0070.616 ±0.006\pm 0.006

Score normalization improves performance also when using GATs, showing that our method generalizes to other architectures, besides Transformers and BiLSTMS. We have included these results to the paper.

In addition to our technical contribution, we believe that our work contributes to the broad NeurIPS community, by suggesting the importance of peculiar settings such as normalization in the training of neural network models. Moreover, we would like to respectfully point out that the NeurIPS guideline of acceptance suggests a work with "high impact on at least one sub-area of AI" (per grade 5), meaning that broad impact is not strictly required for acceptance. We appreciate the reviewer's consideration in this regard.

*[1] Kazi et al. "Differentiable Graph Module (DGM) for Graph Convolutional Networks". 2023.

*[2] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks". 2017.

*[3] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, "GAAN: Gated attention networks for learning on large and spatiotemporal graphs". 2018.

Q.2: Are there consistent experimental results on other language datasets? Experimental conclusion should be verified on more languages.

A.2: We thank the reviewer for pointing this out. We performed multilingual experiments using six non-English UD datasets (UD_Arabic-PADT, UD_Chinese-GSD, UD_Italian-ISDT, UD_Japanese-GSD, UD_Spanish-AnCora, UD_Wolof-WTB) and verified that our finding is consistent, as shown in the table below.

score normLψL_\psiUD_Arabic-PADTUD_Chinese-GSDUD_Italian-ISDTUD_Japanese-GSDUD_Spanish-AnCoraUD_Wolof-WTB
×\times00.538 ±0.005\pm\scriptstyle 0.0050.395 ±0.007\pm\scriptstyle 0.0070.563 ±0.006\pm\scriptstyle 0.0060.493 ±0.010\pm\scriptstyle 0.0100.554 ±0.004\pm\scriptstyle 0.0040.252 ±0.007\pm\scriptstyle 0.007
×\times10.723 ±0.008\pm\scriptstyle 0.0080.653 ±0.007\pm\scriptstyle 0.0070.792 ±0.003\pm\scriptstyle 0.0030.812 ±0.003\pm\scriptstyle 0.0030.775 ±0.003\pm\scriptstyle 0.0030.525 ±0.006\pm\scriptstyle 0.006
×\times20.745 ±0.004\pm\scriptstyle 0.0040.710 ±0.005\pm\scriptstyle 0.0050.826 ±0.003\pm\scriptstyle 0.0030.844 ±0.005\pm\scriptstyle 0.0050.807 ±0.002\pm\scriptstyle 0.0020.587 ±0.007\pm\scriptstyle 0.007
×\times30.748 ±0.004\pm\scriptstyle 0.0040.717 ±0.007\pm\scriptstyle 0.0070.832 ±0.002\pm\scriptstyle 0.0020.849 ±0.003\pm\scriptstyle 0.0030.808 ±0.002\pm\scriptstyle 0.0020.614 ±0.005\pm\scriptstyle 0.005
\checkmark00.609 ±0.003\pm\scriptstyle 0.0030.479 ±0.007\pm\scriptstyle 0.0070.633 ±0.002\pm\scriptstyle 0.0020.585 ±0.005\pm\scriptstyle 0.0050.620 ±0.002\pm\scriptstyle 0.0020.305 ±0.005\pm\scriptstyle 0.005
\checkmark10.737 ±0.007\pm\scriptstyle 0.0070.693 ±0.003\pm\scriptstyle 0.0030.820 ±0.004\pm\scriptstyle 0.0040.838 ±0.004\pm\scriptstyle 0.0040.801 ±0.002\pm\scriptstyle 0.0020.556 ±0.003\pm\scriptstyle 0.003
\checkmark20.758 ±0.003\pm\scriptstyle 0.0030.736 ±0.005\pm\scriptstyle 0.0050.842 ±0.004\pm\scriptstyle 0.0040.859 ±0.004\pm\scriptstyle 0.0040.822 ±0.001\pm\scriptstyle 0.0010.613 ±0.008\pm\scriptstyle 0.008
\checkmark30.759 ±0.005\pm\scriptstyle 0.0050.742 ±0.006\pm\scriptstyle 0.0060.845 ±0.002\pm\scriptstyle 0.0020.859 ±0.003\pm\scriptstyle 0.0030.823 ±0.002\pm\scriptstyle 0.0020.633 ±0.005\pm\scriptstyle 0.005

Q.3: What are experimental results if add a Normal Layer?

A.3: As regards the addition of LayerNorm, experimental results on the effect of adding normalization layers were present in the original manuscript. We refer the reviewer to Appendix B.5 for our analysis. In Table 8, we show that the addition of LayerNorm layers in between the BiLSTM layers is detrimental on average, over all datasets.

评论

Thank you for your detailed rebuttal. While I appreciate the additional experimentation you have done with GATs and the multilingual dataset, I maintain my original rate scores.

My main concern remains the limited theoretical contribution. While this work demonstrates that normalization of biaffine score functions can reduce over-parameterization, this finding is limited. Its core finding, that normalization helps handle high-variance inputs to the softmax, aligns with well-established principles in deep learning.

The expanded experiments strengthen the empirical validation, but they do not fundamentally change the nature of the contribution. The 85% parameter reduction is impressive, yet it primarily benefits a specific parsing architecture rather than providing broader insights applicable across different domains or tasks.

While the work is technically solid, I believe the contribution falls short of the significance threshold typically expected for acceptance.

审稿意见
4

This paper investigates the impact of score normalization in biaffine dependency parsing models. This work argued that the current dependency parsers using biaffine scoring are over parameterized because they do NOT the score normalization which is typically used in Transformer in attention mechanisms. The paper provided theoretical justification through implicit regularization theory. It shows that deeper networks reduce weight matrix rank and consequently score variance. It demonstrated that the normalizing biaffine scores can achieve comparable or even better performance with up to 85% fewer parameters.

优缺点分析

Strengths:

  1. The connection between Transformer attention normalization and biaffine scoring is well-motivated.
  2. The paper includes experiments on 6 diverse datasets, covering both semantic (ADE, CoNLL04, SciERC, ERFGC) and syntactic (enEWT, SciDTB) dependency parsing. It achieves similar performance with 85% parameter reduction. It includes systematic exploration on different layer depths, hyperparameters, architectural components, etc.

Weaknesses:

  1. While Claim 1 provides intuition, the proof is somewhat informal. The connection between rank reduction and variance reduction could be more rigorously established.
  2. The normalization benefits vary significantly across datasets. In addition, the main focus of the paper on BiLSTM-based architectures may limit applicability to modern Transformer-based parsers.

问题

No.

局限性

The paper includes very limited discussions on limitations. I would suggest add some like the potential negative impact on datasets where current deep architectures are well-tuned, etc.

格式问题

No.

作者回复

Q.1: While Claim 1 provides intuition, the proof is somewhat informal. The connection between rank reduction and variance reduction could be more rigorously established.

A.1: We thank the reviewer for pointing out that our proof (and claim) could be made more rigorous. We have restated the claim and proof formally with a more standard analysis of covariance using the trace. Please see below:

Claim 1 (Monotonic Increase of Output Variance with Rank)

Let XRnX \in \mathbb{R}^n be a random vector with covariance matrix Kxx:=Cov(X)Rn×n\mathbf{K_{xx}} := \text{Cov}(X) \in \mathbb{R}^{n \times n}, and let ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} be a fixed matrix with singular value decomposition (SVD):

A=i=1min(m,n)σiuivt\mathbf{A} = \sum_{i=1}^{min(m, n)} \sigma_i \mathbf{u}_i \mathbf{v}_t^{\top}

where σ1σ2...0\sigma_1 \geq \sigma_2 \geq ... \geq 0. Define the best rank-rr approximation of A\mathbf{A} by truncated SVD as:

Ar:=i=1rσiuivt\mathbf{A_r} := \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_t^{\top}

Let Yr=ArXRmY_r = \mathbf{A}_rX \in \mathbb{R}^m denote the image of XX under the rank-rr linear map. Then, the total variance of YrY_r, as measured by the trace of its covariance matrix, increases monotonically with rr:

tr(Cov(Yr))tr(Cov(Yr+1))\text{tr}(\text{Cov}(Y_r)) \leq \text{tr}(\text{Cov}(Y_{r + 1}))

Proof. Let Kxx:=Cov(X)Rn×n\mathbf{K_{xx}} := \text{Cov}(X) \in \mathbb{R}^{n \times n} denote the covariance matrix of the random vector XX. Since covariance matrices are symmetric and positive semi-definite (PSD), we have Kxx0\mathbf{K_{xx}} \succeq 0. Let Ar:=i=1rσiuivt\mathbf{A_r} := \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_t^{\top} be the rank-rr truncated SVD of A\mathbf{A}. Then the covariance matrix of Yr:=ArXY_r := \mathbf{A_r} X is:

Cov(Yr)=ArKxxAr.\text{Cov}(Y_r) = \mathbf{A_r} \mathbf{K_{xx}} \mathbf{A_r}^{\top}.

To evaluate the total variance we compute the trace:

tr(Cov(Yr))=tr(ArKxxAr).\text{tr}\left(\text{Cov}(Y_r)\right) = \text{tr}\left(\mathbf{A_r}\mathbf{K_{xx}} \mathbf{A_r}^{\top}\right).

Using the cyclic property of the trace (tr(ABC)=tr(BCA))\left(\text{tr}(ABC) = \text{tr}(BCA)\right) and symmetry of Kxx\mathbf{K_{xx}} we get:

tr(ArKxxAr)=tr(KxxArAr).\text{tr}\left(\mathbf{A_r}\mathbf{K_{xx}} \mathbf{A_r}^{\top}\right) = \text{tr}\left(\mathbf{K_{xx}} \mathbf{A_r}^{\top} \mathbf{A_r} \right).

Now define the matrix Mr:=ArARn×n\mathbf{M_r} := \mathbf{A_r}^{\top} \mathbf{A} \in \mathbb{R}^{n \times n}, which is PSD. As rr increases, Ar\mathbf{A_r} includes more terms in its truncated SVD, so we have:

Mr=i=1rσi2vivi,Mr+1=Mr+σr+12vr+1vr+1.\mathbf{M_{r}} = \sum_{i=1}^r \sigma^2_i \mathbf{v_i} \mathbf{v_i}^{\top}, \quad \mathbf{M_{r+1}} = \mathbf{M_r} + \sigma^2_{r+1} \mathbf{v_{r+1}} \mathbf{v_{r+1}}^{\top}.

Thus:

Mr+1Mr.\mathbf{M}_{r+1} \succeq \mathbf{M}_r.

Since Kxx0\mathbf{K_{xx}} \succeq 0, and since the trace of a product of PSD matrices respects Loewner order (i.e., if ABA \preceq B, then tr(CA)tr(CB)\text{tr}(CA) \leq \text{tr}(CB) for all C0C \succeq 0), we conclude:

tr(KxxMr)tr(KxxMr+1),\text{tr}(\mathbf{K_{xx}} \mathbf{M_r}) \leq \text{tr}(\mathbf{K_{xx}}\mathbf{M_{r+1}}),

which implies:

tr(Cov(Yr))tr(Cov(Yr+1)).\text{tr}\left(\text{Cov}(Y_r)\right) \leq \text{tr}\left(\text{Cov}(Y_{r+1})\right).

Therefore, as the rank rr increases, the total variance YrY_r increases. \square

Q.2.1: The normalization benefits vary significantly across datasets.

A.2.1: Thank you for raising this point. In order to show the generalizability of our results, we have carried out six additional experiments on multilingual syntactic dependency parsing, showing consistent results for a wide variety of languages.

score normLψL_\psiUD_Arabic-PADTUD_Chinese-GSDUD_Italian-ISDTUD_Japanese-GSDUD_Spanish-AnCoraUD_Wolof-WTB
×\times00.538 ±0.005\pm\scriptstyle 0.0050.395 ±0.007\pm\scriptstyle 0.0070.563 ±0.006\pm\scriptstyle 0.0060.493 ±0.010\pm\scriptstyle 0.0100.554 ±0.004\pm\scriptstyle 0.0040.252 ±0.007\pm\scriptstyle 0.007
×\times10.723 ±0.008\pm\scriptstyle 0.0080.653 ±0.007\pm\scriptstyle 0.0070.792 ±0.003\pm\scriptstyle 0.0030.812 ±0.003\pm\scriptstyle 0.0030.775 ±0.003\pm\scriptstyle 0.0030.525 ±0.006\pm\scriptstyle 0.006
×\times20.745 ±0.004\pm\scriptstyle 0.0040.710 ±0.005\pm\scriptstyle 0.0050.826 ±0.003\pm\scriptstyle 0.0030.844 ±0.005\pm\scriptstyle 0.0050.807 ±0.002\pm\scriptstyle 0.0020.587 ±0.007\pm\scriptstyle 0.007
×\times30.748 ±0.004\pm\scriptstyle 0.0040.717 ±0.007\pm\scriptstyle 0.0070.832 ±0.002\pm\scriptstyle 0.0020.849 ±0.003\pm\scriptstyle 0.0030.808 ±0.002\pm\scriptstyle 0.0020.614 ±0.005\pm\scriptstyle 0.005
\checkmark00.609 ±0.003\pm\scriptstyle 0.0030.479 ±0.007\pm\scriptstyle 0.0070.633 ±0.002\pm\scriptstyle 0.0020.585 ±0.005\pm\scriptstyle 0.0050.620 ±0.002\pm\scriptstyle 0.0020.305 ±0.005\pm\scriptstyle 0.005
\checkmark10.737 ±0.007\pm\scriptstyle 0.0070.693 ±0.003\pm\scriptstyle 0.0030.820 ±0.004\pm\scriptstyle 0.0040.838 ±0.004\pm\scriptstyle 0.0040.801 ±0.002\pm\scriptstyle 0.0020.556 ±0.003\pm\scriptstyle 0.003
\checkmark20.758 ±0.003\pm\scriptstyle 0.0030.736 ±0.005\pm\scriptstyle 0.0050.842 ±0.004\pm\scriptstyle 0.0040.859 ±0.004\pm\scriptstyle 0.0040.822 ±0.001\pm\scriptstyle 0.0010.613 ±0.008\pm\scriptstyle 0.008
\checkmark30.759 ±0.005\pm\scriptstyle 0.0050.742 ±0.006\pm\scriptstyle 0.0060.845 ±0.002\pm\scriptstyle 0.0020.859 ±0.003\pm\scriptstyle 0.0030.823 ±0.002\pm\scriptstyle 0.0020.633 ±0.005\pm\scriptstyle 0.005

Furthermore, we believe it is important to keep into account that we found the difference in performance to be statistically significant on all datasets, as mentioned in Figure 3.

Q.2.2: In addition, the main focus of the paper on BiLSTM-based architectures may limit applicability to modern Transformer-based parsers.

A.2.2: We would like to point out that the parser with which we experiment is Transformer-based, with the Dozat and Manning biaffine layer still being the standard for dependency parsing. Even without using BiLSTM layers and fully fine-tuning the Transformer, we have found that normalization provides statistically significant benefits, as shown in Figure 4 for the SciERC dataset. In Appendix B.6, we also show this more generally on all four SemDP datasets with BERT-base, BERT-large, DeBERTa-base, and DeBERTa-large. The effect is particularly pronounced for DeBERTa-base.

审稿意见
3

This work presents theoretical evidence and empirical results for the influence of scaling the scores produced by biaffine transformations in dependency parsing tasks. The authors find that the score variance produced by a lack of score scaling hurts model performance when predicting edges and relations. Experimental results demonstrate that dependency parsing models equipped with normalization can obtain better performance with substantially fewer trained parameters.

优缺点分析

Strengths:

  1. Biaffine score function is a common and default component in dependency parsing tasks. The authors are very careful and aware of the potential problem in it. They analyze the relation between normalization and parameter quantity from theoretical and experimental perspectives.
  2. This paper conduct extensive experiments to verify their claim.

Weaknesses:

  1. I wonder whether this finding is still effective in other languages, especially low-resource languages.
  2. This research question is too minor to have a significant impact on the broad research community. I think the main research content of this paper is not suitable for publication in NeurIPS. I recommend that the author pay attention to more targeted venues, so that researchers in specific sub-community can pay attention to this work.
  3. Please change eh, ed, rh, and rd in figure 2 to the format of mathematical vector (i.e., ehe^h, ede^d, rhr^h and rdr^d), so they are consistent with the text and equation.

问题

See the weaknesses

局限性

yes

最终评判理由

The work lags behind the current state of the art in dependency parsing.

格式问题

none

作者回复

Q.1: I wonder whether this finding is still effective in other languages, especially low-resource languages.

A.1: We thank the reviewer for suggesting multilingual settings. We have run the model on six non-English UD datasets (UD_Arabic-PADT, UD_Chinese-GSD, UD_Italian-ISDT, UD_Japanese-GSD, UD_Spanish-AnCora, UD_Wolof-WTB) and verified that our findings are consistent. In the following we summarized the multilingual results. We have added them to the paper.

score normLψL_\psiUD_Arabic-PADTUD_Chinese-GSDUD_Italian-ISDTUD_Japanese-GSDUD_Spanish-AnCoraUD_Wolof-WTB
×\times00.538 ±0.005\pm\scriptstyle 0.0050.395 ±0.007\pm\scriptstyle 0.0070.563 ±0.006\pm\scriptstyle 0.0060.493 ±0.010\pm\scriptstyle 0.0100.554 ±0.004\pm\scriptstyle 0.0040.252 ±0.007\pm\scriptstyle 0.007
×\times10.723 ±0.008\pm\scriptstyle 0.0080.653 ±0.007\pm\scriptstyle 0.0070.792 ±0.003\pm\scriptstyle 0.0030.812 ±0.003\pm\scriptstyle 0.0030.775 ±0.003\pm\scriptstyle 0.0030.525 ±0.006\pm\scriptstyle 0.006
×\times20.745 ±0.004\pm\scriptstyle 0.0040.710 ±0.005\pm\scriptstyle 0.0050.826 ±0.003\pm\scriptstyle 0.0030.844 ±0.005\pm\scriptstyle 0.0050.807 ±0.002\pm\scriptstyle 0.0020.587 ±0.007\pm\scriptstyle 0.007
×\times30.748 ±0.004\pm\scriptstyle 0.0040.717 ±0.007\pm\scriptstyle 0.0070.832 ±0.002\pm\scriptstyle 0.0020.849 ±0.003\pm\scriptstyle 0.0030.808 ±0.002\pm\scriptstyle 0.0020.614 ±0.005\pm\scriptstyle 0.005
\checkmark00.609 ±0.003\pm\scriptstyle 0.0030.479 ±0.007\pm\scriptstyle 0.0070.633 ±0.002\pm\scriptstyle 0.0020.585 ±0.005\pm\scriptstyle 0.0050.620 ±0.002\pm\scriptstyle 0.0020.305 ±0.005\pm\scriptstyle 0.005
\checkmark10.737 ±0.007\pm\scriptstyle 0.0070.693 ±0.003\pm\scriptstyle 0.0030.820 ±0.004\pm\scriptstyle 0.0040.838 ±0.004\pm\scriptstyle 0.0040.801 ±0.002\pm\scriptstyle 0.0020.556 ±0.003\pm\scriptstyle 0.003
\checkmark20.758 ±0.003\pm\scriptstyle 0.0030.736 ±0.005\pm\scriptstyle 0.0050.842 ±0.004\pm\scriptstyle 0.0040.859 ±0.004\pm\scriptstyle 0.0040.822 ±0.001\pm\scriptstyle 0.0010.613 ±0.008\pm\scriptstyle 0.008
\checkmark30.759 ±0.005\pm\scriptstyle 0.0050.742 ±0.006\pm\scriptstyle 0.0060.845 ±0.002\pm\scriptstyle 0.0020.859 ±0.003\pm\scriptstyle 0.0030.823 ±0.002\pm\scriptstyle 0.0020.633 ±0.005\pm\scriptstyle 0.005

Q.2: This research question is too minor to have a significant impact on the broad research community. I think the main research content of this paper is not suitable for publication in NeurIPS. I recommend that the author pay attention to more targeted venues, so that researchers in specific sub-community can pay attention to this work.

A.2: While it is true that semantic and syntactic dependency parsing are specific to communities in NLP, our paper contributes towards analyzing normalization phenomena in architectures used broadly in graph inference tasks (modeling edges as fully connected weighted graphs via attention mechanisms such as biaffine transformations, see [1,2,3] for reference). To further demonstrate this, we have conducted an experiment using Graph Attention Networks (GAT) by adding a stack of biaffine/GAT pairs. The purpose of this is to encode higher-order dependencies before the final biaffine layer. The results with this new architecture are consistent, as shown in the table below.

normLϕL_\phiCoNLL04ADESciERCenEWTSciDTBERFGC
×\times00.526 ±0.046\pm 0.0460.517 ±0.038\pm 0.0380.123 ±0.050\pm 0.0500.610 ±0.009\pm 0.0090.727 ±0.006\pm 0.0060.536 ±0.013\pm 0.013
×\times10.493 ±0.037\pm 0.0370.509 ±0.022\pm 0.0220.039 ±0.020\pm 0.0200.583 ±0.011\pm 0.0110.731 ±0.009\pm 0.0090.599 ±0.008\pm 0.008
×\times20.485 ±0.033\pm 0.0330.509 ±0.025\pm 0.0250.029 ±0.041\pm 0.0410.540 ±0.012\pm 0.0120.710 ±0.013\pm 0.0130.611 ±0.015\pm 0.015
×\times30.481 ±0.037\pm 0.0370.487 ±0.040\pm 0.0400.000 ±0.000\pm 0.0000.511 ±0.015\pm 0.0150.700 ±0.010\pm 0.0100.585 ±0.009\pm 0.009
\checkmark00.574 ±0.028\pm 0.0280.556 ±0.037\pm 0.0370.156 ±0.031\pm 0.0310.671 ±0.007\pm 0.0070.778 ±0.006\pm 0.0060.607 ±0.009\pm 0.009
\checkmark10.550 ±0.036\pm 0.0360.587 ±0.015\pm 0.0150.148 ±0.028\pm 0.0280.696 ±0.005\pm 0.0050.809 ±0.008\pm 0.0080.639 ±0.007\pm 0.007
\checkmark20.563 ±0.029\pm 0.0290.534 ±0.023\pm 0.0230.128 ±0.029\pm 0.0290.657 ±0.007\pm 0.0070.795 ±0.008\pm 0.0080.637 ±0.007\pm 0.007
\checkmark30.514 ±0.026\pm 0.0260.543 ±0.024\pm 0.0240.071 ±0.021\pm 0.0210.596 ±0.010\pm 0.0100.770 ±0.007\pm 0.0070.616 ±0.006\pm 0.006

Similar to BiLSTMs, the normalization improves the scores in the case of GATs as well, showing that our method generalizes to other architectures, besides Transformers and BiLSTMS. We have included these results to the paper.

In addition to our technical contribution, we believe that our work contributes to the broad NeurIPS community, by suggesting the importance of peculiar settings such as normalization in the training of neural network models. Moreover, we would like to respectfully point out that the NeurIPS guideline of acceptance suggests a work with "high impact on at least one sub-area of AI" (per grade 5), meaning that broad impact is not strictly required for acceptance. We appreciate the reviewer's consideration in this regard.

  • [1] Kazi et al. "Differentiable Graph Module (DGM) for Graph Convolutional Networks". 2023.

  • [2] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks". 2017.

  • [3] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, "GAAN: Gated attention networks for learning on large and spatiotemporal graphs". 2018.

Q.3: Please change eh, ed, rh, and rd in figure 2 to the format of mathematical vector (i.e., ehe^h, ede^d, rhr^h, and rdr^d), so they are consistent with the text and equation.

A.3: We have amended the formatting oversight as requested.

审稿意见
5

It seems that normalizing biaffine scores for BiLSTMs gives significant improvement in terms of performance and parameter efficiency. This claim is supported by running dependency parsing task on multiple datasets, with several ablation studies. A possible reason for this is also discussed in section 3.

优缺点分析

Strengths

  • one possible cause of this result discussed sufficiently. Claim: maybe more layers cause variance to be in check and normalization is not much needed as small singular values decay.
  • many datasets utilized that have dependency parsing of different kinds (recipes, news, drug adverse effects, etc.)
  • normalizing shown to help with thorough experiments, with significance of outcomes measures satisfactorily
  • parameter efficiency for one particular case shown to be up to 85% more with normalization where with normalization only one layer is enough to match the performance of three without normalization

Weaknesses

  • for now only seen for dependency parsing task
  • while a possible reason for raw version performing okay with higher number of layers is discussed, it is not known why in some cases the performance of normalized version drops and exhibits high variance when number of layers is increased (e.g. Appendix, line 489) [See Question related to this as well]
  • the gain for tagging tasks not that significant, but this adequately discussed

问题

  • We see in some cases, e.g. in Figure 5 that the normalized version seems to drop in performance compared to the raw version as number of layers is increased. Do you have any guesses of why that might be?

局限性

We don't see a separate section for limitations, but in my opinion limitations are adequately and honestly discussed. Some with other text (e.g. size of dataset and kind of dataset when introduced), some in conclusions (last paragraph), and some in appendix.

最终评判理由

I thank authors for additional information provided.

I have read other reviews and I see the main concerns, and here are my thoughts about them:

  • Results on other languages e.g. low resource ones

Although authors provide some results on more languages, I believe that is a separate question. The paper is good enough contribution only for high-resource languages

  • The contribution being simple

Yes perhaps, it is quite simple. I believe that is not grounds for rejection. And simple solutions should be published to benefit the wider ML / NLP community.

  • The contribution being too niche and perhaps better suited for more niche venues

Maybe. I am not aware if it's kind of outside the area of NeurIPS.

格式问题

  • see line 38. you probably meant to write Std([something])?
  • line 208 it says "UD". The words UD and enEWT are sometimes interchangeably used. The table referred uses enEWT.
  • it might be better for some graphs if possible to use false origin on y-axis to see better e.g. in figure 7. This is only a suggestion if easily possible.
作者回复

We thank the reviewer for their insightful review.

Q.1: for now only seen for dependency parsing task

A.1: We acknowledge that our work targets semantic and syntactic dependency parsing tasks only. This was purposeful because this variant (Dozat and Manning) of biaffine scoring is widespread on these tasks in particular. While we plan to extend our normalization analysis to graph inference broadly (e.g. using QM9 for small molecules or citation networks), we believe that the current work presents a solid and focused contribution using one of the most common and SOTA architectures. Moreover, extending our analysis to non-linguistic tasks would involve many architectural variants (e.g. GNN types) that we feel are necessary to cover in a separate work built on the contribution of the current work. As an initial step towards using Graph Attention Networks (GAT), we have repeated the experiment by adding a stack of biaffine/GAT pairs to encode higher-order dependencies before the final biaffine layer. We achieve consistent results as shown in the table below. The performance increases for Lϕ{1,2}L_\phi \in \{1, 2\}, compared to Lϕ=0L_\phi = 0, but drops again at Lϕ=3L_\phi = 3:

normLϕL_\phiCoNLL04ADESciERCenEWTSciDTBERFGC
n00.526 ±0.046\pm 0.0460.517 ±0.038\pm 0.0380.123 ±0.050\pm 0.0500.610 ±0.009\pm 0.0090.727 ±0.006\pm 0.0060.536 ±0.013\pm 0.013
n10.493 ±0.037\pm 0.0370.509 ±0.022\pm 0.0220.039 ±0.020\pm 0.0200.583 ±0.011\pm 0.0110.731 ±0.009\pm 0.0090.599 ±0.008\pm 0.008
n20.485 ±0.033\pm 0.0330.509 ±0.025\pm 0.0250.029 ±0.041\pm 0.0410.540 ±0.012\pm 0.0120.710 ±0.013\pm 0.0130.611 ±0.015\pm 0.015
n30.481 ±0.037\pm 0.0370.487 ±0.040\pm 0.0400.000 ±0.000\pm 0.0000.511 ±0.015\pm 0.0150.700 ±0.010\pm 0.0100.585 ±0.009\pm 0.009
y00.574 ±0.028\pm 0.0280.556 ±0.037\pm 0.0370.156 ±0.031\pm 0.0310.671 ±0.007\pm 0.0070.778 ±0.006\pm 0.0060.607 ±0.009\pm 0.009
y10.550 ±0.036\pm 0.0360.587 ±0.015\pm 0.0150.148 ±0.028\pm 0.0280.696 ±0.005\pm 0.0050.809 ±0.008\pm 0.0080.639 ±0.007\pm 0.007
y20.563 ±0.029\pm 0.0290.534 ±0.023\pm 0.0230.128 ±0.029\pm 0.0290.657 ±0.007\pm 0.0070.795 ±0.008\pm 0.0080.637 ±0.007\pm 0.007
y30.514 ±0.026\pm 0.0260.543 ±0.024\pm 0.0240.071 ±0.021\pm 0.0210.596 ±0.010\pm 0.0100.770 ±0.007\pm 0.0070.616 ±0.006\pm 0.006

The normalization improves the scores in the case of GATs as well, showing that our method generalizes to other architectures, besides Transformers and BiLSTMS. We have included these results to the paper.

Q.2: while a possible reason for raw version performing okay with higher number of layers is discussed, it is not known why in some cases the performance of normalized version drops and exhibits high variance when number of layers is increased

A.2: We have repeated the experiment with Lψ>3L_\psi > 3 over three seeds and increased the amount of training steps to 20,000. All of the models made it to at least 6k steps (but not all of them made it to 8k or further, due to early stopping). We show the results at 2k, 4k, and 6k steps below (rows with no high variances removed due to character limit):

2000 steps:

normLψL_\psiCoNLL04ADESciERCenEWTSciDTBERFGC
n70.606 ±0.012\pm 0.0120.731 ±0.004\pm 0.0040.260 ±0.026\pm 0.0260.811 ±0.007\pm 0.0070.898 ±0.006\pm 0.0060.698 ±0.006\pm 0.006
n80.594 ±0.012\pm 0.0120.712 ±0.012\pm 0.0120.248 ±0.002\pm 0.0020.783 ±0.003\pm 0.0030.882 ±0.007\pm 0.0070.692 ±0.010\pm 0.010
n100.565 ±0.012\pm 0.0120.701 ±0.018\pm 0.0180.233 ±0.022\pm 0.0220.718 ±0.049\pm 0.0490.839 ±0.034\pm 0.0340.441 ±0.312\pm 0.312
y90.215 ±0.289\pm 0.2890.229 ±0.324\pm 0.3240.260 ±0.039\pm 0.0390.791 ±0.006\pm 0.0060.883 ±0.007\pm 0.0070.450 ±0.311\pm 0.311
y100.386 ±0.261\pm 0.2610.622 ±0.018\pm 0.0180.168 ±0.120\pm 0.1200.769 ±0.016\pm 0.0160.879 ±0.009\pm 0.0090.648 ±0.006\pm 0.006

4000 steps:

normLψL_\psiCoNLL04ADESciERCenEWTSciDTBERFGC
n70.551 ±0.150\pm 0.1500.719 ±0.017\pm 0.0170.284 ±0.032\pm 0.0320.833 ±0.022\pm 0.0220.907 ±0.010\pm 0.0100.700 ±0.005\pm 0.005
n80.507 ±0.227\pm 0.2270.711 ±0.008\pm 0.0080.271 ±0.023\pm 0.0230.813 ±0.030\pm 0.0300.895 ±0.014\pm 0.0140.694 ±0.008\pm 0.008
n100.588 ±0.036\pm 0.0360.701 ±0.033\pm 0.0330.262 ±0.035\pm 0.0350.755 ±0.059\pm 0.0590.865 ±0.037\pm 0.0370.442 ±0.313\pm 0.313
y90.279 ±0.270\pm 0.2700.244 ±0.322\pm 0.3220.289 ±0.041\pm 0.0410.813 ±0.022\pm 0.0220.894 ±0.012\pm 0.0120.457 ±0.311\pm 0.311
y100.399 ±0.278\pm 0.2780.651 ±0.032\pm 0.0320.165 ±0.123\pm 0.1230.797 ±0.030\pm 0.0300.890 ±0.014\pm 0.0140.656 ±0.009\pm 0.009

6000 steps:

normLψL_\psiCoNLL04ADESciERCenEWTSciDTBERFGC
n70.581 ±0.130\pm 0.1300.712 ±0.021\pm 0.0210.307 ±0.041\pm 0.0410.846 ±0.026\pm 0.0260.910 ±0.010\pm 0.0100.702 ±0.006\pm 0.006
n80.466 ±0.254\pm 0.2540.708 ±0.009\pm 0.0090.292 ±0.039\pm 0.0390.829 ±0.033\pm 0.0330.901 ±0.015\pm 0.0150.695 ±0.011\pm 0.011
n100.592 ±0.036\pm 0.0360.549 ±0.295\pm 0.2950.288 ±0.047\pm 0.0470.779 ±0.062\pm 0.0620.877 ±0.035\pm 0.0350.448 ±0.317\pm 0.317
y90.329 ±0.282\pm 0.2820.242 ±0.328\pm 0.3280.310 ±0.047\pm 0.0470.827 ±0.027\pm 0.0270.899 ±0.012\pm 0.0120.455 ±0.311\pm 0.311
y100.410 ±0.280\pm 0.2800.669 ±0.037\pm 0.0370.187 ±0.141\pm 0.1410.813 ±0.034\pm 0.0340.896 ±0.014\pm 0.0140.658 ±0.010\pm 0.010

The results show that once a model fails to find a good trajectory at the start, it likely remains on a poor trajectory. For example, the model with Lψ{7,8}L_\psi \in \{7, 8\} and norm = n does fine on CoNNL04 at 2,000 steps; however, one of the seeds then fails randomly at 4,000 steps and the performance remains poor past that point. For ADE, we can see that model performance with Lψ=10L_\psi = 10 and norm = n is fine at 2k and 4k steps, but suddenly drops with higher variance at 6k steps. Once again, this is because the model suddenly breaks and then remains broken for a specific seed. The table below does not show some of the poor performances because it reports the results of each seed at the best validation performance, up to 20k steps:

normLψL_\psiCoNLL04ADESciERCenEWTSciDTBERFGC
n70.668 ±0.015\pm 0.0150.722 ±0.016\pm 0.0160.380 ±0.032\pm 0.0320.894 ±0.002\pm 0.0020.929 ±0.002\pm 0.0020.706 ±0.003\pm 0.003
n80.625 ±0.028\pm 0.0280.707 ±0.025\pm 0.0250.392 ±0.016\pm 0.0160.893 ±0.002\pm 0.0020.926 ±0.002\pm 0.0020.699 ±0.007\pm 0.007
n100.635 ±0.038\pm 0.0380.720 ±0.011\pm 0.0110.373 ±0.016\pm 0.0160.879 ±0.007\pm 0.0070.917 ±0.002\pm 0.0020.454 ±0.321\pm 0.321
y90.454 ±0.274\pm 0.2740.511 ±0.313\pm 0.3130.384 ±0.024\pm 0.0240.880 ±0.001\pm 0.0010.916 ±0.001\pm 0.0010.457 ±0.306\pm 0.306
y100.478 ±0.283\pm 0.2830.695 ±0.022\pm 0.0220.237 ±0.169\pm 0.1690.881 ±0.001\pm 0.0010.911 ±0.002\pm 0.0020.662 ±0.015\pm 0.015

Despite the random crashes, increasing the number of steps increased the performance and reduced the Std for deeper stacks, compared to the original manuscript. While we still see instability for CoNLL04 and SciERC, it is reduced compared to the results shown in the original manuscript in Figure 5. Furthermore, the high variance almost disappears for ADE and ERFGC when training for a higher number of steps. Even more importantly, the results are identical for enEWT and SciDTB, with and without normalization, hinting at the fact that this is not inherently a problem of the model, but has to do with the nature of the datasets. More specifically, we also see some instability for ERFGC without score normalization at Lψ=10L_\psi = 10, and at Lψ=9L_\psi = 9 with score normalization. Similarly high variance can be seen for ADE with normalization at Lψ=9L_\psi = 9 (but not 10). Given that the high variances are produced by a few zero or near-zero scores, the instability seems to mostly be a product of the higher number of layers.

We have also analyzed the behavior of the gradients w.r.t. the depth of the layers. For SciERC (highest Std at Lψ=10L_\psi = 10 in the original manuscript), the table below shows the head and tail for 14k steps of norm of the gradient for the 10 layers:

step0123456789
00.0020.0020.0040.0080.0200.0480.1150.2790.6381.607
...
139990.0700.0390.0180.0140.0130.0100.0120.0130.0360.072

Since the gradient evens out, we reckon it not to be the issue at hand here. It is likely that the higher parameter count of 10 layers simply necessitates more training steps.

Formatting:

We commit to fix the issues raised by the reviewer w.r.t. line 38, line 208, and commit to improving the readability of Figs. 6 and 7.

最终决定

The paper presents results of link prediction, in particular for dependency parsing, after implementation of score normalisation. The authors motivate the normalisation theoretically based on architecture and exchange out a corresponding module and its required parameters, arguing that lack of normalisation resulted in overparamterisation. In the rebuttal, the authors also showed that their trade-off is also valid for other (non-linguistic) tasks. While the paper is heavily motivated with the dependency parsing task, it is clear that the observations are about network topology and not limited to this task. While the paper aligns with well-established principles in deep learning, the reviewers do not point out any redundancy in this particular finding. The task itself, however, dependency parsing, is a central task to NLP. For researchers without access to compute resources beyond a laptop, an 85% reduction in parameters is of central importance.