Dependency Parsing is More Parameter-Efficient with Normalization
摘要
评审与讨论
This work studies the effect of normalization on the scores produced by biaffine functions in dependency parsing tasks. The authors demonstrate, both theoretically and empirically, a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. Experimental results show that similar or better performance can be obtained by reducing the amount of trained BiLSTM parameters by as much as 85%.
优缺点分析
Strengths:
- This paper explores a problem that is overlooked. Biaffine function is a standard implement, and few people pay attention to it and its relationship with parameter quantity.
- Comprehensive experiments.
Weaknesses:
- Although this research question is very detailed and easily overlooked, I don't think this contribution meets the acceptance bar of NeurIPS.
- Are there consistent experimental results on other language datasets? Experimental conclusion should be verified on more languages.
- What are experimental results if add a Normal Layer?
问题
See the weaknesses.
局限性
yes
最终评判理由
First, I do not think the contribution well-suitable for NeurIPS. Second, even in the community of *ACL, the contribution is not strong enough. Might be better for a workshop oriented to syntax parsing.
格式问题
N/A
Q.1: Although this research question is very detailed and easily overlooked, I don't think this contribution meets the acceptance bar of NeurIPS.
A.1: We agree with the reviewer that the topic of our contribution is indeed overlooked. Thus, our work is novel in terms of improving dependency parsing as an application. Furthermore, our contribution involves analyzing normalization phenomena in architectures used broadly in graph inference tasks (i.e., modeling edges as fully connected weighted graphs via attention mechanisms, such as biaffine transformations, see [1,2,3] for reference). To show this, we have repeated the experiment using Graph Attention Networks (GAT), using a stack of biaffine/GAT pairs to encode higher-order dependencies before the final biaffine layer. The results with this new architecture are consistent, as shown in the table below. Moreover, we have expanded the experiments to six multilingual settings as reported in our answer to the reviewer's next comment.
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.526 | 0.517 | 0.123 | 0.610 | 0.727 | 0.536 |
| 0 | 1 | 0.493 | 0.509 | 0.039 | 0.583 | 0.731 | 0.599 |
| 0 | 2 | 0.485 | 0.509 | 0.029 | 0.540 | 0.710 | 0.611 |
| 0 | 3 | 0.481 | 0.487 | 0.000 | 0.511 | 0.700 | 0.585 |
| 1 | 0 | 0.574 | 0.556 | 0.156 | 0.671 | 0.778 | 0.607 |
| 1 | 1 | 0.550 | 0.587 | 0.148 | 0.696 | 0.809 | 0.639 |
| 1 | 2 | 0.563 | 0.534 | 0.128 | 0.657 | 0.795 | 0.637 |
| 1 | 3 | 0.514 | 0.543 | 0.071 | 0.596 | 0.770 | 0.616 |
Score normalization improves performance also when using GATs, showing that our method generalizes to other architectures, besides Transformers and BiLSTMS. We have included these results to the paper.
In addition to our technical contribution, we believe that our work contributes to the broad NeurIPS community, by suggesting the importance of peculiar settings such as normalization in the training of neural network models. Moreover, we would like to respectfully point out that the NeurIPS guideline of acceptance suggests a work with "high impact on at least one sub-area of AI" (per grade 5), meaning that broad impact is not strictly required for acceptance. We appreciate the reviewer's consideration in this regard.
*[1] Kazi et al. "Differentiable Graph Module (DGM) for Graph Convolutional Networks". 2023.
*[2] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks". 2017.
*[3] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, "GAAN: Gated attention networks for learning on large and spatiotemporal graphs". 2018.
Q.2: Are there consistent experimental results on other language datasets? Experimental conclusion should be verified on more languages.
A.2: We thank the reviewer for pointing this out. We performed multilingual experiments using six non-English UD datasets (UD_Arabic-PADT, UD_Chinese-GSD, UD_Italian-ISDT, UD_Japanese-GSD, UD_Spanish-AnCora, UD_Wolof-WTB) and verified that our finding is consistent, as shown in the table below.
| score norm | UD_Arabic-PADT | UD_Chinese-GSD | UD_Italian-ISDT | UD_Japanese-GSD | UD_Spanish-AnCora | UD_Wolof-WTB | |
|---|---|---|---|---|---|---|---|
| 0 | 0.538 | 0.395 | 0.563 | 0.493 | 0.554 | 0.252 | |
| 1 | 0.723 | 0.653 | 0.792 | 0.812 | 0.775 | 0.525 | |
| 2 | 0.745 | 0.710 | 0.826 | 0.844 | 0.807 | 0.587 | |
| 3 | 0.748 | 0.717 | 0.832 | 0.849 | 0.808 | 0.614 | |
| 0 | 0.609 | 0.479 | 0.633 | 0.585 | 0.620 | 0.305 | |
| 1 | 0.737 | 0.693 | 0.820 | 0.838 | 0.801 | 0.556 | |
| 2 | 0.758 | 0.736 | 0.842 | 0.859 | 0.822 | 0.613 | |
| 3 | 0.759 | 0.742 | 0.845 | 0.859 | 0.823 | 0.633 |
Q.3: What are experimental results if add a Normal Layer?
A.3: As regards the addition of LayerNorm, experimental results on the effect of adding normalization layers were present in the original manuscript. We refer the reviewer to Appendix B.5 for our analysis. In Table 8, we show that the addition of LayerNorm layers in between the BiLSTM layers is detrimental on average, over all datasets.
Thank you for your detailed rebuttal. While I appreciate the additional experimentation you have done with GATs and the multilingual dataset, I maintain my original rate scores.
My main concern remains the limited theoretical contribution. While this work demonstrates that normalization of biaffine score functions can reduce over-parameterization, this finding is limited. Its core finding, that normalization helps handle high-variance inputs to the softmax, aligns with well-established principles in deep learning.
The expanded experiments strengthen the empirical validation, but they do not fundamentally change the nature of the contribution. The 85% parameter reduction is impressive, yet it primarily benefits a specific parsing architecture rather than providing broader insights applicable across different domains or tasks.
While the work is technically solid, I believe the contribution falls short of the significance threshold typically expected for acceptance.
This paper investigates the impact of score normalization in biaffine dependency parsing models. This work argued that the current dependency parsers using biaffine scoring are over parameterized because they do NOT the score normalization which is typically used in Transformer in attention mechanisms. The paper provided theoretical justification through implicit regularization theory. It shows that deeper networks reduce weight matrix rank and consequently score variance. It demonstrated that the normalizing biaffine scores can achieve comparable or even better performance with up to 85% fewer parameters.
优缺点分析
Strengths:
- The connection between Transformer attention normalization and biaffine scoring is well-motivated.
- The paper includes experiments on 6 diverse datasets, covering both semantic (ADE, CoNLL04, SciERC, ERFGC) and syntactic (enEWT, SciDTB) dependency parsing. It achieves similar performance with 85% parameter reduction. It includes systematic exploration on different layer depths, hyperparameters, architectural components, etc.
Weaknesses:
- While Claim 1 provides intuition, the proof is somewhat informal. The connection between rank reduction and variance reduction could be more rigorously established.
- The normalization benefits vary significantly across datasets. In addition, the main focus of the paper on BiLSTM-based architectures may limit applicability to modern Transformer-based parsers.
问题
No.
局限性
The paper includes very limited discussions on limitations. I would suggest add some like the potential negative impact on datasets where current deep architectures are well-tuned, etc.
格式问题
No.
Q.1: While Claim 1 provides intuition, the proof is somewhat informal. The connection between rank reduction and variance reduction could be more rigorously established.
A.1: We thank the reviewer for pointing out that our proof (and claim) could be made more rigorous. We have restated the claim and proof formally with a more standard analysis of covariance using the trace. Please see below:
Claim 1 (Monotonic Increase of Output Variance with Rank)
Let be a random vector with covariance matrix , and let be a fixed matrix with singular value decomposition (SVD):
where . Define the best rank- approximation of by truncated SVD as:
Let denote the image of under the rank- linear map. Then, the total variance of , as measured by the trace of its covariance matrix, increases monotonically with :
Proof. Let denote the covariance matrix of the random vector . Since covariance matrices are symmetric and positive semi-definite (PSD), we have . Let be the rank- truncated SVD of . Then the covariance matrix of is:
To evaluate the total variance we compute the trace:
Using the cyclic property of the trace and symmetry of we get:
Now define the matrix , which is PSD. As increases, includes more terms in its truncated SVD, so we have:
Thus:
Since , and since the trace of a product of PSD matrices respects Loewner order (i.e., if , then for all ), we conclude:
which implies:
Therefore, as the rank increases, the total variance increases.
Q.2.1: The normalization benefits vary significantly across datasets.
A.2.1: Thank you for raising this point. In order to show the generalizability of our results, we have carried out six additional experiments on multilingual syntactic dependency parsing, showing consistent results for a wide variety of languages.
| score norm | UD_Arabic-PADT | UD_Chinese-GSD | UD_Italian-ISDT | UD_Japanese-GSD | UD_Spanish-AnCora | UD_Wolof-WTB | |
|---|---|---|---|---|---|---|---|
| 0 | 0.538 | 0.395 | 0.563 | 0.493 | 0.554 | 0.252 | |
| 1 | 0.723 | 0.653 | 0.792 | 0.812 | 0.775 | 0.525 | |
| 2 | 0.745 | 0.710 | 0.826 | 0.844 | 0.807 | 0.587 | |
| 3 | 0.748 | 0.717 | 0.832 | 0.849 | 0.808 | 0.614 | |
| 0 | 0.609 | 0.479 | 0.633 | 0.585 | 0.620 | 0.305 | |
| 1 | 0.737 | 0.693 | 0.820 | 0.838 | 0.801 | 0.556 | |
| 2 | 0.758 | 0.736 | 0.842 | 0.859 | 0.822 | 0.613 | |
| 3 | 0.759 | 0.742 | 0.845 | 0.859 | 0.823 | 0.633 |
Furthermore, we believe it is important to keep into account that we found the difference in performance to be statistically significant on all datasets, as mentioned in Figure 3.
Q.2.2: In addition, the main focus of the paper on BiLSTM-based architectures may limit applicability to modern Transformer-based parsers.
A.2.2: We would like to point out that the parser with which we experiment is Transformer-based, with the Dozat and Manning biaffine layer still being the standard for dependency parsing. Even without using BiLSTM layers and fully fine-tuning the Transformer, we have found that normalization provides statistically significant benefits, as shown in Figure 4 for the SciERC dataset. In Appendix B.6, we also show this more generally on all four SemDP datasets with BERT-base, BERT-large, DeBERTa-base, and DeBERTa-large. The effect is particularly pronounced for DeBERTa-base.
This work presents theoretical evidence and empirical results for the influence of scaling the scores produced by biaffine transformations in dependency parsing tasks. The authors find that the score variance produced by a lack of score scaling hurts model performance when predicting edges and relations. Experimental results demonstrate that dependency parsing models equipped with normalization can obtain better performance with substantially fewer trained parameters.
优缺点分析
Strengths:
- Biaffine score function is a common and default component in dependency parsing tasks. The authors are very careful and aware of the potential problem in it. They analyze the relation between normalization and parameter quantity from theoretical and experimental perspectives.
- This paper conduct extensive experiments to verify their claim.
Weaknesses:
- I wonder whether this finding is still effective in other languages, especially low-resource languages.
- This research question is too minor to have a significant impact on the broad research community. I think the main research content of this paper is not suitable for publication in NeurIPS. I recommend that the author pay attention to more targeted venues, so that researchers in specific sub-community can pay attention to this work.
- Please change eh, ed, rh, and rd in figure 2 to the format of mathematical vector (i.e., , , and ), so they are consistent with the text and equation.
问题
See the weaknesses
局限性
yes
最终评判理由
The work lags behind the current state of the art in dependency parsing.
格式问题
none
Q.1: I wonder whether this finding is still effective in other languages, especially low-resource languages.
A.1: We thank the reviewer for suggesting multilingual settings. We have run the model on six non-English UD datasets (UD_Arabic-PADT, UD_Chinese-GSD, UD_Italian-ISDT, UD_Japanese-GSD, UD_Spanish-AnCora, UD_Wolof-WTB) and verified that our findings are consistent. In the following we summarized the multilingual results. We have added them to the paper.
| score norm | UD_Arabic-PADT | UD_Chinese-GSD | UD_Italian-ISDT | UD_Japanese-GSD | UD_Spanish-AnCora | UD_Wolof-WTB | |
|---|---|---|---|---|---|---|---|
| 0 | 0.538 | 0.395 | 0.563 | 0.493 | 0.554 | 0.252 | |
| 1 | 0.723 | 0.653 | 0.792 | 0.812 | 0.775 | 0.525 | |
| 2 | 0.745 | 0.710 | 0.826 | 0.844 | 0.807 | 0.587 | |
| 3 | 0.748 | 0.717 | 0.832 | 0.849 | 0.808 | 0.614 | |
| 0 | 0.609 | 0.479 | 0.633 | 0.585 | 0.620 | 0.305 | |
| 1 | 0.737 | 0.693 | 0.820 | 0.838 | 0.801 | 0.556 | |
| 2 | 0.758 | 0.736 | 0.842 | 0.859 | 0.822 | 0.613 | |
| 3 | 0.759 | 0.742 | 0.845 | 0.859 | 0.823 | 0.633 |
Q.2: This research question is too minor to have a significant impact on the broad research community. I think the main research content of this paper is not suitable for publication in NeurIPS. I recommend that the author pay attention to more targeted venues, so that researchers in specific sub-community can pay attention to this work.
A.2: While it is true that semantic and syntactic dependency parsing are specific to communities in NLP, our paper contributes towards analyzing normalization phenomena in architectures used broadly in graph inference tasks (modeling edges as fully connected weighted graphs via attention mechanisms such as biaffine transformations, see [1,2,3] for reference). To further demonstrate this, we have conducted an experiment using Graph Attention Networks (GAT) by adding a stack of biaffine/GAT pairs. The purpose of this is to encode higher-order dependencies before the final biaffine layer. The results with this new architecture are consistent, as shown in the table below.
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| 0 | 0.526 | 0.517 | 0.123 | 0.610 | 0.727 | 0.536 | |
| 1 | 0.493 | 0.509 | 0.039 | 0.583 | 0.731 | 0.599 | |
| 2 | 0.485 | 0.509 | 0.029 | 0.540 | 0.710 | 0.611 | |
| 3 | 0.481 | 0.487 | 0.000 | 0.511 | 0.700 | 0.585 | |
| 0 | 0.574 | 0.556 | 0.156 | 0.671 | 0.778 | 0.607 | |
| 1 | 0.550 | 0.587 | 0.148 | 0.696 | 0.809 | 0.639 | |
| 2 | 0.563 | 0.534 | 0.128 | 0.657 | 0.795 | 0.637 | |
| 3 | 0.514 | 0.543 | 0.071 | 0.596 | 0.770 | 0.616 |
Similar to BiLSTMs, the normalization improves the scores in the case of GATs as well, showing that our method generalizes to other architectures, besides Transformers and BiLSTMS. We have included these results to the paper.
In addition to our technical contribution, we believe that our work contributes to the broad NeurIPS community, by suggesting the importance of peculiar settings such as normalization in the training of neural network models. Moreover, we would like to respectfully point out that the NeurIPS guideline of acceptance suggests a work with "high impact on at least one sub-area of AI" (per grade 5), meaning that broad impact is not strictly required for acceptance. We appreciate the reviewer's consideration in this regard.
-
[1] Kazi et al. "Differentiable Graph Module (DGM) for Graph Convolutional Networks". 2023.
-
[2] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks". 2017.
-
[3] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, "GAAN: Gated attention networks for learning on large and spatiotemporal graphs". 2018.
Q.3: Please change eh, ed, rh, and rd in figure 2 to the format of mathematical vector (i.e., , , , and ), so they are consistent with the text and equation.
A.3: We have amended the formatting oversight as requested.
It seems that normalizing biaffine scores for BiLSTMs gives significant improvement in terms of performance and parameter efficiency. This claim is supported by running dependency parsing task on multiple datasets, with several ablation studies. A possible reason for this is also discussed in section 3.
优缺点分析
Strengths
- one possible cause of this result discussed sufficiently. Claim: maybe more layers cause variance to be in check and normalization is not much needed as small singular values decay.
- many datasets utilized that have dependency parsing of different kinds (recipes, news, drug adverse effects, etc.)
- normalizing shown to help with thorough experiments, with significance of outcomes measures satisfactorily
- parameter efficiency for one particular case shown to be up to 85% more with normalization where with normalization only one layer is enough to match the performance of three without normalization
Weaknesses
- for now only seen for dependency parsing task
- while a possible reason for raw version performing okay with higher number of layers is discussed, it is not known why in some cases the performance of normalized version drops and exhibits high variance when number of layers is increased (e.g. Appendix, line 489) [See Question related to this as well]
- the gain for tagging tasks not that significant, but this adequately discussed
问题
- We see in some cases, e.g. in Figure 5 that the normalized version seems to drop in performance compared to the raw version as number of layers is increased. Do you have any guesses of why that might be?
局限性
We don't see a separate section for limitations, but in my opinion limitations are adequately and honestly discussed. Some with other text (e.g. size of dataset and kind of dataset when introduced), some in conclusions (last paragraph), and some in appendix.
最终评判理由
I thank authors for additional information provided.
I have read other reviews and I see the main concerns, and here are my thoughts about them:
- Results on other languages e.g. low resource ones
Although authors provide some results on more languages, I believe that is a separate question. The paper is good enough contribution only for high-resource languages
- The contribution being simple
Yes perhaps, it is quite simple. I believe that is not grounds for rejection. And simple solutions should be published to benefit the wider ML / NLP community.
- The contribution being too niche and perhaps better suited for more niche venues
Maybe. I am not aware if it's kind of outside the area of NeurIPS.
格式问题
- see line 38. you probably meant to write Std([something])?
- line 208 it says "UD". The words UD and enEWT are sometimes interchangeably used. The table referred uses enEWT.
- it might be better for some graphs if possible to use false origin on y-axis to see better e.g. in figure 7. This is only a suggestion if easily possible.
We thank the reviewer for their insightful review.
Q.1: for now only seen for dependency parsing task
A.1: We acknowledge that our work targets semantic and syntactic dependency parsing tasks only. This was purposeful because this variant (Dozat and Manning) of biaffine scoring is widespread on these tasks in particular. While we plan to extend our normalization analysis to graph inference broadly (e.g. using QM9 for small molecules or citation networks), we believe that the current work presents a solid and focused contribution using one of the most common and SOTA architectures. Moreover, extending our analysis to non-linguistic tasks would involve many architectural variants (e.g. GNN types) that we feel are necessary to cover in a separate work built on the contribution of the current work. As an initial step towards using Graph Attention Networks (GAT), we have repeated the experiment by adding a stack of biaffine/GAT pairs to encode higher-order dependencies before the final biaffine layer. We achieve consistent results as shown in the table below. The performance increases for , compared to , but drops again at :
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| n | 0 | 0.526 | 0.517 | 0.123 | 0.610 | 0.727 | 0.536 |
| n | 1 | 0.493 | 0.509 | 0.039 | 0.583 | 0.731 | 0.599 |
| n | 2 | 0.485 | 0.509 | 0.029 | 0.540 | 0.710 | 0.611 |
| n | 3 | 0.481 | 0.487 | 0.000 | 0.511 | 0.700 | 0.585 |
| y | 0 | 0.574 | 0.556 | 0.156 | 0.671 | 0.778 | 0.607 |
| y | 1 | 0.550 | 0.587 | 0.148 | 0.696 | 0.809 | 0.639 |
| y | 2 | 0.563 | 0.534 | 0.128 | 0.657 | 0.795 | 0.637 |
| y | 3 | 0.514 | 0.543 | 0.071 | 0.596 | 0.770 | 0.616 |
The normalization improves the scores in the case of GATs as well, showing that our method generalizes to other architectures, besides Transformers and BiLSTMS. We have included these results to the paper.
Q.2: while a possible reason for raw version performing okay with higher number of layers is discussed, it is not known why in some cases the performance of normalized version drops and exhibits high variance when number of layers is increased
A.2: We have repeated the experiment with over three seeds and increased the amount of training steps to 20,000. All of the models made it to at least 6k steps (but not all of them made it to 8k or further, due to early stopping). We show the results at 2k, 4k, and 6k steps below (rows with no high variances removed due to character limit):
2000 steps:
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| n | 7 | 0.606 | 0.731 | 0.260 | 0.811 | 0.898 | 0.698 |
| n | 8 | 0.594 | 0.712 | 0.248 | 0.783 | 0.882 | 0.692 |
| n | 10 | 0.565 | 0.701 | 0.233 | 0.718 | 0.839 | 0.441 |
| y | 9 | 0.215 | 0.229 | 0.260 | 0.791 | 0.883 | 0.450 |
| y | 10 | 0.386 | 0.622 | 0.168 | 0.769 | 0.879 | 0.648 |
4000 steps:
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| n | 7 | 0.551 | 0.719 | 0.284 | 0.833 | 0.907 | 0.700 |
| n | 8 | 0.507 | 0.711 | 0.271 | 0.813 | 0.895 | 0.694 |
| n | 10 | 0.588 | 0.701 | 0.262 | 0.755 | 0.865 | 0.442 |
| y | 9 | 0.279 | 0.244 | 0.289 | 0.813 | 0.894 | 0.457 |
| y | 10 | 0.399 | 0.651 | 0.165 | 0.797 | 0.890 | 0.656 |
6000 steps:
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| n | 7 | 0.581 | 0.712 | 0.307 | 0.846 | 0.910 | 0.702 |
| n | 8 | 0.466 | 0.708 | 0.292 | 0.829 | 0.901 | 0.695 |
| n | 10 | 0.592 | 0.549 | 0.288 | 0.779 | 0.877 | 0.448 |
| y | 9 | 0.329 | 0.242 | 0.310 | 0.827 | 0.899 | 0.455 |
| y | 10 | 0.410 | 0.669 | 0.187 | 0.813 | 0.896 | 0.658 |
The results show that once a model fails to find a good trajectory at the start, it likely remains on a poor trajectory. For example, the model with and norm = n does fine on CoNNL04 at 2,000 steps; however, one of the seeds then fails randomly at 4,000 steps and the performance remains poor past that point. For ADE, we can see that model performance with and norm = n is fine at 2k and 4k steps, but suddenly drops with higher variance at 6k steps. Once again, this is because the model suddenly breaks and then remains broken for a specific seed. The table below does not show some of the poor performances because it reports the results of each seed at the best validation performance, up to 20k steps:
| norm | CoNLL04 | ADE | SciERC | enEWT | SciDTB | ERFGC | |
|---|---|---|---|---|---|---|---|
| n | 7 | 0.668 | 0.722 | 0.380 | 0.894 | 0.929 | 0.706 |
| n | 8 | 0.625 | 0.707 | 0.392 | 0.893 | 0.926 | 0.699 |
| n | 10 | 0.635 | 0.720 | 0.373 | 0.879 | 0.917 | 0.454 |
| y | 9 | 0.454 | 0.511 | 0.384 | 0.880 | 0.916 | 0.457 |
| y | 10 | 0.478 | 0.695 | 0.237 | 0.881 | 0.911 | 0.662 |
Despite the random crashes, increasing the number of steps increased the performance and reduced the Std for deeper stacks, compared to the original manuscript. While we still see instability for CoNLL04 and SciERC, it is reduced compared to the results shown in the original manuscript in Figure 5. Furthermore, the high variance almost disappears for ADE and ERFGC when training for a higher number of steps. Even more importantly, the results are identical for enEWT and SciDTB, with and without normalization, hinting at the fact that this is not inherently a problem of the model, but has to do with the nature of the datasets. More specifically, we also see some instability for ERFGC without score normalization at , and at with score normalization. Similarly high variance can be seen for ADE with normalization at (but not 10). Given that the high variances are produced by a few zero or near-zero scores, the instability seems to mostly be a product of the higher number of layers.
We have also analyzed the behavior of the gradients w.r.t. the depth of the layers. For SciERC (highest Std at in the original manuscript), the table below shows the head and tail for 14k steps of norm of the gradient for the 10 layers:
| step | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.002 | 0.002 | 0.004 | 0.008 | 0.020 | 0.048 | 0.115 | 0.279 | 0.638 | 1.607 |
| ... | ||||||||||
| 13999 | 0.070 | 0.039 | 0.018 | 0.014 | 0.013 | 0.010 | 0.012 | 0.013 | 0.036 | 0.072 |
Since the gradient evens out, we reckon it not to be the issue at hand here. It is likely that the higher parameter count of 10 layers simply necessitates more training steps.
Formatting:
We commit to fix the issues raised by the reviewer w.r.t. line 38, line 208, and commit to improving the readability of Figs. 6 and 7.
The paper presents results of link prediction, in particular for dependency parsing, after implementation of score normalisation. The authors motivate the normalisation theoretically based on architecture and exchange out a corresponding module and its required parameters, arguing that lack of normalisation resulted in overparamterisation. In the rebuttal, the authors also showed that their trade-off is also valid for other (non-linguistic) tasks. While the paper is heavily motivated with the dependency parsing task, it is clear that the observations are about network topology and not limited to this task. While the paper aligns with well-established principles in deep learning, the reviewers do not point out any redundancy in this particular finding. The task itself, however, dependency parsing, is a central task to NLP. For researchers without access to compute resources beyond a laptop, an 85% reduction in parameters is of central importance.