PaperHub
6.6
/10
Poster5 位审稿人
最低3最高8标准差2.0
6
8
3
8
8
4.0
置信度
ICLR 2024

Ensemble Distillation for Unsupervised Constituency Parsing

OpenReviewPDF
提交: 2023-09-23更新: 2024-03-09
TL;DR

The paper proposes an ensemble method and multi-teacher distillation approach for unsupervised constituency parsing, demonstrating robustness and effectiveness.

摘要

关键词
Constituency ParsingUnsupervised Grammar InductionKnowledge Distillation

评审与讨论

审稿意见
6

The paper considers ensembling unsupervised (unlabled binary) constituency parsers. The main technical piece is an MBR decoding algorithm that finds the highest-F1 tree with respect to a set of candidate trees. Experiments show that this ensembling method is more effective than simply selecting a max-intra-F1 candidate tree or training a student model on the union of candidate trees (union distillation).

优点

  • Simple but insightful observation that it is possible to "generate" a tree from candidate trees
  • Derivation of the hit-count maximization algorithm
  • Strong results

缺点

  • Somewhat narrow scope (ensembling constituency trees)
  • Some details missing (see the questions)

问题

Would "Our ensemble (X teacher across runs)" in Table 2 use candidate trees from different models, or do they all end up from the same model (e.g., ConTest for X="best")? It's good to know the answer to this question because one of the main claims of the paper is that it's important to exploit the large qualitative differences between different methods (Table 1). The fact that we get the best result by just using outputs from the same model seems to refute that claim (i.e., it's all variance reduction).

评论

Thanks for the review.

Weakness 1: Somewhat narrow scope (ensembling constituency trees)

Unsupervised constituency parsing, in fact, fits ICLR themes well, as it concerns learning the representations of human language. In the literature, a number of impactful papers were published in previous editions of ICLR, such as Shen et al., 2018, 2019.

In addition, our study has general machine learning implications, especially for multi-teacher distillation. Our study shows that a straightforward union distillation from multiple teachers (as was done in previous work) may not yield improved performance, whereas our ensemble-then-distill approach is able to alleviate the over-smoothing issue in traditional methods. We’re happy to explore multi-teacher distillation for different data types (such as sequences and graphs), mentioned in Future Work.

Weakness 2: Some details missing (see the questions)

Questions: Would "Our ensemble (X teacher across runs)" in Table 2 use candidate trees from different models, or do they all end up from the same model (e.g., ConTest for X="best")? It's good to know the answer to this question because one of the main claims of the paper is that it's important to exploit the large qualitative differences between different methods (Table 1). The fact that we get the best result by just using outputs from the same model seems to refute that claim (i.e., it's all variance reduction).

Thanks for the detailed question. By “Best/Worst teacher across runs”, we mean that we must still choose exactly one run for a model, but which run may not correspond among models. A simple example:

Model1: Run1=85, Run2=90, Run3=95

Model2: Run1=65, Run2=60, Run3=55

Then, the best teachers across runs will be Model1-Run3 and Model2-Run1. The worst teachers across runs will be Model1-Run1 and Model2-Run3.

We’ve clarified this in the revision (third paragraph in Sec 3.3).

审稿意见
8

This paper proposes a consistency-based decoding method for unsupervised constituency parsing, which can also be formulated as minimum Bayes risk decoding. Experiments demonstrate significant improvement over existing methods.

优点

  • Significant contribution to unsupervised constituency parsing, including a generative MBR process and consistently improved results.

  • Comprehensive analyses are carefully conducted and presented.

  • The paper is very well written and easy to follow.

I also wanted to note the statements and conclusions in this paper are remarkably honest. Most of the claims are very well supported by related work and experimental results. While this is generally not considered as a strength, such a presentation style should be commended and encouraged in recent years.

缺点

  • MBR-style or consistency-based methods have also been applied to parsing for decoding or model selection, but the authors failed to recognize and discuss them. To name a few, Smith and Smith, 2007 and Zhang et al., 2020 used MBR-decoding to improve dependency parsing; Shi et al., 2019 adapted an agreement-based model selection process for distantly supervised constituency parsing.

  • The motivation is not completely convincing. Recent trends in NLP demonstrate that explicit parses might not be crucial or even necessary to many user-facing applications (e.g., GPT models do not really use explicit language structures), which contradicts the first sentence in this paper (Constituency parsing is a core task in natural language processing). Traditionally, such structures served as a backbone for many NLP models, and the prediction of them was therefore referred to as core NLP. I am not sure if the parsing is as important as what I receive from this paper. Please consider revising or including more justification.

Minor points on weaknesses

  • The references in this paper, especially to conventional linguistic literature, need some work:

    • Section 1: Carnie (2007) and Fromkin et al. (2003) are introductory books. Both should be changed to Chomsky (1957). Syntactic Structures.
    • Spitkovsky et al. (2013) is worth a mention in the related work section.
  • Page 1: low correlation among different unsupervised parsers: Williams et al. (2018) discovered a similar issue within the same model architectures (Table 1, right settings). This is worth a discussion.

  • Not really a weakness for a machine learning conference submission: since the topic of this paper is highly linguistic, I am willing to see a detailed analysis of what patterns are fixed. For example, do NPs/PPs with rare words receive more fixes than those with frequent words, or the opposite, or not significant? Do VPs with transitive verb heads receive more fixes than those with intransitive verb heads, or the opposite, or insignificant? Does the student extract any constituent that does not receive any vote from teacher models, due to fixes on shorter spans and CYK?

I am being conservative in my initial evaluation and am happy to increase my rating if most of the above issues are fixed.

问题

  • Which split of PTB did you use to generate the statistics in Table 1?
  • Table 2: why are the +RNNG/+URNNG oracle performance different from the basic one (83.3)?
评论

We thank the reviewer for the support, especially for recognizing the scientific rigor of our study by mentioning “the statements and conclusions in this paper are remarkably honest. Most of the claims are very well supported by related work and experimental results.”

Weakness 1 (MBR-style or consistency-based literature review)

Thanks for sharing the literature. We have included the suggested references, as well as some of their references, in the revision.

Weakness 2 (Motivation of core NLP)

We’d clarify that, by saying “a core task,” we mean that parsing has been traditionally referred to as a core NLP task. In the revision, we call parsing “a well-established task.”

Nevertheless, we would like to point out the significance of unsupervised parsing as a curious task of language structure discovery. It verifies linguistic theory, showing that linguistically defined constituents can naturally emerge in an unsupervised way. In our related work, we also discussed how unsupervised parsing may inspire the structure discovery of motion-sensor data (Peng et al., 2011).

Minor points on weaknesses 1 (conventional linguistic literature references)

We’re grateful for the suggestions and have included them in the revision.

Minor points on weaknesses 2 (Related work about the low correlation discovery)

Thanks for the suggestion. The mentioned work is discussed and cited in the revision. Williams et al., 2018 showed low correlations among early latent-tree models that were claimed to induce task-specific tree structures, whereas our paper shows low correlation among unsupervised parsing models that are aimed at discovering linguistically plausible tree structures. Further, our paper shows that such low correlation suggests different expertise of unsupervised parsers, which can be utilized by model ensembles for performance improvement.

Minor points on weaknesses 3: Detailed linguistic analysis of what patterns are fixed

In the performance-by-type analysis, we’ve shown that our approach achieves consistently high performance across all constituency types. We’re further presenting a case study in the appendix, suggesting that our ensemble indeed performs voting for local structures and fixes teachers’ predictions.

Question 1: Which split of PTB did you use to generate the statistics in Table 1?

We used the standard split, namely, Section 23 of PTB for tests.

Question 2: why are the +RNNG/+URNNG oracle performance different from the basic one (83.3)?

In Table 2, the +RNNG column for Oracle row (Row 15) is, in fact, a supervised RNNG trained on the binarized PTB-training set (binarized ground-truth). +URNNG is further tuning the RNNG in an unsupervised manner. The performance is lower than Oracle, which is the performance upper bound.

审稿意见
3

This paper proposes a novel strategy following an ensemble-then-distill paradigm to deal with the unsupervised constituency parsing task that aims to hierarchically structure sentences without relying on linguistically annotated data. The proposed approach firstly ensembles existing unsupervised parsers based on the notion of “tree averaging” and then conducts distillation to create a student model. This technique efficiently alleviates the over-smoothing issue that frequently arises in multi-teacher distillation. Experimental results indicate that such an ensemble-then-distill method outperforms existing approaches with superior effectiveness and robustness.

优点

The major contributions include:

  1.  A new notion of tree averaging and the corresponding search algorithm: CYK variant
    
  2.  Ensemble-then-distill approach that trains a student parser from an ensemble of teachers.
    
  3.  The inference time of the student model is 18x faster than the ensemble method.
    
  4.  A hypothesis that different unsupervised parsers capture different aspects of the language structures and the verification with experiments.
    

缺点

  1. Lack of clarification regarding the methodology design.
  • The averaging tree is derived with the highest total F1 score compared with different teachers. Have the authors tried other methods of calculating the similarities between trees? Perhaps a fair comparison is needed to further indicate the effectiveness of the proposed tree averaging method.
  • The authors did not provide a detailed explanation for choosing the seven unsupervised parsers introduced in Section 3.2 as teacher models. For instance, why did the authors select ContextDistort as one of the teachers despite its relatively inferior performance and inference efficiency?
  1. The authors state in Introduction that combining the different parsers may leverage their different expertise. The authors attempt to verify this statement in Section 3.4 by comparing two settings: the ensemble of three runs of the same model and that of three heterogeneous models. I wonder if it is more appropriate to choose the model with highest performance in the former setting so that it can be further validated there exists additional boost due to different expertise (e.g., Neural PCFG for Group 2).
  2. The writing can be improved. There are some typos and unclear descriptions. Please refer to comments for detail.

Comments

  1. Minor comments on writing: (1) Paragraph #1 in Introduction: ...to explore unsupervised methods as it eliminates... -> ...to explore unsupervised methods as they eliminate... (2) Paragraph #1 in Section 3.1: ... on the widely used the Penn Treebank -> ... on the widely-used Penn Treebank

问题

Despite these merits, there are some points which need further clarification, and some suggestions.

  1.  The ensemble method demonstrates its effectiveness on PTB. However, in 2020, the F1 score of CRF parser on PTB variants had already been above 90 (Zhang et al., 2020). There is indeed a performance boost in comparison to oracle score (the highest possible F1 score of binarize groundtruth trees). But it is not appropriate to claim that “largly bridging the gap between supervised and unsupervised constituency parsing” on Page 6.
    
  2.  In Results on SUSANNE, the authors claim that “This is a realistic experiment to examine the models’ performance in an unseen low-resource domain.” However, SUSANNE is an English dataset. In CoNLL Shared task, there are tree-banks on low-resource languages (Zeman et al., 2017). It’s better that the authors can demonstrate the effectiveness of the approach on some of these low-resource language datasets.
    

References:

Zeman, D., Popel, M., Straka, M., Hajic, J., Nivre, J., Ginter, F., Luotolahti, J., Pyysalo, S., Petrov, S., Potthast, M., Tyers, F., Badmaeva, E., Gokirmak, M., Nedoluzhko, A., Cinkova, S., Hajic Jr., J., Hlavacova, J., Kettnerová, V., Uresova, Z., … Li, J. (2017). CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 1–19. https://doi.org/10.18653/v1/K17-3001

Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 4046–4053. https://doi.org/10.24963/ijcai.2020/560

  1.  In Table 3 on Page 7, PTB-supervised model is used for comparison. Which PTB-supervised model is used?
    
评论

We thank the reviewer for recognizing a number of our key contributions, including a new notion of tree averaging, the ensemble-then-distill approach, the efficient inference, and an intriguing phenomenon.

Weakness 1. Lack of clarification regarding the methodology design

  • (F1 score) We chose the F1 score because F1 is the standard measure for constituency parsing, explicitly mentioned in the second paragraph of Sec 2.2. Therefore, our similarity score is well motivated. We would be grateful if the reviewer can recommend alternatives that are more justified.

  • (Choice of teacher models) We chose the seven teachers because they are classic or state-of-the-art unsupervised parsers (mentioned in Sec. 3.2, first line) with publicly available code. Specifically, we chose ContextDistort because it is a very recent study (ACL2023) and largely differs from previous approaches in its methodology. Note that in the analysis of the number of teachers (Fig. 2), we have experimented with various combinations of the teachers, where the results are consistent.

Weakness 2: “I wonder if it is more appropriate to choose the model with highest performance in the former setting so that it can be further validated there exists additional boost due to different expertise (e.g., Neural PCFG for Group 2).”

In Figure 1, four of the seven groups (namely, Groups 1, 4, 6, and 7) meet the criterion that the reviewer mentioned. Hence, it's possible to draw the conclusion that the reviewer is seeking from a subset of our experimentation. Moreover, we do not believe it is essential to use the best-performing model of the latter setting for multiple runs in the former, control setting. As we can see from the figure, the aforementioned four groups are not different from the rest.

Weakness 3 (two typos in a paper)

Thanks for catching two typos in our paper! We’ve fixed them in the revision.

Question 1: The ensemble method demonstrates its effectiveness on PTB. However, in 2020, the F1 score of CRF parser on PTB variants had already been above 90 (Zhang et al., 2020). There is indeed a performance boost in comparison to oracle score (the highest possible F1 score of binarize groundtruth trees). But it is not appropriate to claim that “largly bridging the gap between supervised and unsupervised constituency parsing” on Page 6.

Thanks for pointing this out. We have revised our claim as “largely bridging the gap between supervised and unsupervised binary constituency parsing”.

Question 2: In Results on SUSANNE, the authors claim that “This is a realistic experiment to examine the models’ performance in an unseen low-resource domain.” However, SUSANNE is an English dataset. In CoNLL Shared task, there are tree-banks on low-resource languages (Zeman et al., 2017). It’s better that the authors can demonstrate the effectiveness of the approach on some of these low-resource language datasets.

Here, our claim about the SUSANNE experiment is domain shift, instead of low-resource language. As shown in Table 3, models trained on PTB underperform in other domains in the same language, showing that unsupervised parsing remains challenging in low-resource domains, even for English.

As mentioned in the response to Reviewer yYrJ, a key difficulty of a non-English ensemble is to find teacher unsupervised parsers, which are less addressed in the literature. We’re happy to address this direction as future work.

Question 3: In Table 3 on Page 7, PTB-supervised model is used for comparison. Which PTB-supervised model is used?

The PTB-supervised models are the RNNG and URNNG, shown by the column headers.


Overall, the reviewer identified a number of important contributions in our work, but gave an outrageous score of 3 by missing the key rationales of our model design and key experimental results that are already mentioned in the paper, as well as overly emphasizing two typos in our paper as a weakness of the writing. On the contrary, our contributions, thorough experimentation, and clear writing are well recognized by all the other reviewers.

In our author response, we have answered all the questions raised by this reviewer. We urge the reviewer to revisit our paper and adjust the score accordingly.

审稿意见
8

This paper proposes an ensembling approach for the task of unsupervised constituency parsing. Having trained a group of various prior models, they use a dynamic program to find the “average tree” of their predictions. While this approach works well, it can then be further used for distillation into an RNNG which works more efficiently and in some settings more accurately. They also analyze to what extent the improvements in performance from ensembling are down to smoothing vs. combining expertise.

优点

  • The insight that different models are weakly correlated despite similar F1 is interesting and well motivates the approach
  • The proposed dynamic program is intuitive and well explained
  • Experiments are strong and thorough
  • The analysis of gains from denoising vs. difference in expertise is well conducted

缺点

  • The fact that the F1 gains from distillation do not carry over to the out of domain setting is a drawback and somewhat underexplored
  • There is a lack of qualitative analysis of the types of behaviors that different model types exhibit, and how ensembling actually combines those. Some of this is done in the Appendix but it would be nice to see specific examples in the main paper, especially since that analysis is wrt constituency labels which the model isn’t actually being evaluated on.

问题

  • How does regular RNNG perform, and why not use it as a teacher?
  • Regarding the experiment in Figure 1, do you see similar results if you measure the gains from the distilled ensemble? That would be useful to see alongside
评论

Thank you for recognizing the contributions of our work and strongly supporting our paper.

Weakness 1. not carrying distillation boost under domain shift

Thanks for the insight! We acknowledge that distillation does not help our ensemble, or even the supervised model, in the SUSANNE experiment. To be honest, this is also a little bit unexpected to us as well, but we have honestly stated that in the paper (left to the caption of Table 3). We’re happy to explore domain adaptation of unsupervised parsing more in future work.

Weakness 2: There is a lack of qualitative analysis of the types of behaviors that different model types exhibit, and how ensembling actually combines those. Some of this is done in the Appendix but it would be nice to see specific examples in the main paper, especially since that analysis is wrt constituency labels which the model isn’t actually being evaluated on.

Thanks for the suggestion. We have included a case study in the revision (which inevitably overflows to the appendix due to the volume and substance of our paper).

Question 1: How does regular RNNG perform, and why not use it as a teacher?

RNNG does not get trained well from scratch in an unsupervised manner, which is also reported in previous work [Kim et al., 2019a; Cao et al., 2020] and mentioned in Footnote 4 on Page 6. As a result, we did not use RNNG/URNNG as a teacher.

Question 2: Regarding the experiment in Figure 1, do you see similar results if you measure the gains from the distilled ensemble? That would be useful to see alongside

Thanks for the suggestion. Distillation analysis is expensive, as it requires performing inference for all teachers on the training data, training RNNG, and then refining with URNNG. For Figure 1, it requires 21 teachers' inference over the training set and training 14 RNNG/URNNG models, which may be unaffordable to us.

Nevertheless, we expect the phenomenon to hold for RNNG/URNNG, because during our development, we observed similar patterns among RNNG, URNNG, and their ensemble teacher. The F1 scores are also similar, with RNNG slightly lower and URNNG slightly higher, consistent with all PTB experiments.

评论

Thanks for the detailed clarifications! Considering the overall scope of the paper and its contributions I don't feel that I can raise my score any higher, but I appreciate your response.

评论

That's totally understandable! Should there be any additional questions, we would be happy to answer them (regardless of the score).

审稿意见
8

This work proposes a method for combining the outputs from unsupervised parsers in the manner similar to MBR decoding, but different in considering all possible trees. The proposed method simply assigns score to every span, a hit count that is the number of constituency appearing in outputs from multiple unsupervised parser. Then, it runs CKY to derive the maximum scored tree using the hit count score. Experiments on PTB and SUSANNE presents gains over SOTA baselines.

优点

  • The proposed method is very simple to combine multiple outputs from unsupervised parsing, and the method might have an impact to other system combination method, e.g., NER with CRF. The ensemble by hit-count is sound and the merit is proved effective in the experiments especially when comparing MBR which can consider only spans in the multiple system outputs.

  • Experiments are well designed and the effect of the proposed method is proved empirically. This work also presents knowledge distillation using RNNG and URNNG so that it might have a potential for a practical application. Analysis is also convincing by comparing multiple diverse systems.

缺点

  • It is comparing only for English, and it would be better to compare the model with other languages, e.g., Chinese, for further strengthening this submission.

问题

  • I'd like to know the impact of length of inputs, e.g., whether the proposed method is better in lengthy input or not.
评论

We thank the reviewer for the great insight and strong support.

Weakness: It is comparing only for English, and it would be better to compare the model with other languages, e.g., Chinese, for further strengthening this submission.

Thanks for pointing out that our approach currently works for the English language. A key difficulty of non-English ensemble is to find teacher unsupervised parsers, which are less addressed in the literature. Nevertheless, our ensemble approach brings new opportunities for unsupervised multilingual and non-English parsing (such as transferring and building ensembles with the structural knowledge of different languages). We’re happy to explore this direction in the future, as has been mentioned in our future work section.

Question: I'd like to know the impact of length of inputs, e.g., whether the proposed method is better in lengthy input or not.

Thanks for the suggestion! We actually had the length analysis in our development but didn’t include it in the submission because our paper had already overflowed to the appendix. The finding is similar to the analysis of constituency labels: our ensemble maintains high performance across different lengths, compared with teacher models.

We now show the results in Appendix B.2.

评论

Dear reviewers,

Thank you again for your reviews. In the reviewer-author discussion period, we have addressed all the concerns and questions raised by reviewers, and revised our papers accordingly. Main improvements include fixing two typos and certain expressions, adding a length analysis, and adding a case study. Both additional analyses shed light on the underlying mechanism of our proposed approach.

As we're approaching the end of reviewer-author discussion period, we would be grateful if the reviewers could take a look at our response and revision. Please let us know if you have more questions.

Thanks!

AC 元评审

This paper presents a new pipeline for unsupervised constituency parsing: (1) Train an ensemble of grammar induction systems. (2) Perform unlabelled span MBR decoding with their outputs to find a consensus tree. (3) Distill by training a student model on consensus trees. The results are positive, demonstrating state-of-the-art unsupervised parsing results on PTB. Reviewers are generally in favor of acceptance, with one dissenting opinion. The main strength that reviewers saw was the effectiveness of the approach -- unsupervised constituency parsing is extremely challenging; the approach described in this paper is relatively simple and achieves state-of-the-art results on PTB, historically the predominant benchmark for constituency parsing. Reviewers did point out several weakness, however. First, it was pointed out that the paper only considers experiments on English, limiting the generality of takeaways. Further, the experiments on SUSANNE did not follow the pattern observed on PTB, also suggesting limits to generalizations from these experiments. Finally, one stated contribution of the paper was the novel dynamic program for consensus decoding. A reviewer pointed out that the algorithm is closely related to MBR decoding procedures from the classical parsing literature -- for example, running CKY on bracket posteriors or, as a special case of Max-Rule-Sum from (Petrov and Klein, 2007) -- something the paper acknowledges, but perhaps not as clearly as would be desirable. Authors have updated the paper to cite more of the classical MBR parsing literature, but the paper would be strengthened by making this connection even more clear.

为何不给更高分

The generality of these results may be limited given that only English is studied. The novelty of the proposed dynamic program is limited given the existence of very closely-related algorithms in the classical parsing literature.

为何不给更低分

The results are very strong, establishing a new state-of-the-at on PTB. The method is relatively simple and well-motivated.

最终决定

Accept (poster)