Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery
摘要
评审与讨论
This paper studies selective rationalization. Existing methods suffer at some extent from spurious correlation (i.e., shortcuts). The authors propose shortcuts-fused selective rationalization (SSR) to mitigate spurious correlation. More specifically, they employ semi-supervised rationalization: given an annotated datasets of rationals and labels, they train SSR on it. Then, they train an unsupervised rationalization method on the same data and use the previous model to identify spurious tokens. This new knowledge is then transferred to the unsupervised setup. Since the main method relies on annotated data, the authors propose two data augmentation techniques to mitigate the low-amount of available data.
The method where one exploits a supervised rationalization model to identify spurious rational tokens from an unsupervised model is interesting, but relies on a "large" amount of available rationales. Overall, a supervised rationalization model has to be trained to improve an unsupervised one, which greatly limits the applicability of the method, even though a data augmentation approach is proposed. I would be curious whether transferring the knowledge from one task to another could be possible to some extent (not necessary from movie-dataset-1 to movie-dataset-2).
The experiment section is lacking unsupervised baselines and standard datasets used in selective rationalization [1-6, to cite a few but more are missing] (should also be included in the related work section). In terms of dataset: beers, hotels, amazon, and the other tasks of ERASER. Moreover, I would highly encourage the authors to conduct a human evaluation regarding the produces rationales. The relationship between the number of augmented data vs task/rational performance is currently unclear. I would appreciate having a graph showing how the performance evolve according the number of added data.
1 Bao et al. 2018, Deriving machine attention from human rationales (EMNLP) 2 Chan et al. 2022, UNIREX: A Unified Learning Framework for Language Model Rationale Extraction (ICML) 3 Antognini et al. 2021, Multi-Dimensional Explanation of Target Variables from Documents (AAAI) 4 Chang et al. 2019, A Game Theoretic Approach to Class-wise Selective Rationalization (NeurIPS) 5 Antognini and Faltings 2021, Rationalization through Concepts (ACL) 6 Yu et al. 2019, Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control (EMNLP)
POST-REBUTTAL: Thank you for your additional detailed experiments. I am satisfied with your answer and will increase my score.
优点
- Interesting framework to leverage supervised and unsupervised rationalization models
- The performance (although not the same configuration each time) is closed to supervised baselines
缺点
- the clarity of the paper could be improved, especially section 3
- weak experiment section: more baselines, datasets, and analysis would be required
- lack of human evaluation
问题
- How would perform sup-rat with data augmentation?
- Could you also report metrics regarding comprehensiveness and sufficiency to assess the improvement of the proposed approach to decrease spurious correlation?
Thank you for your time and insightful suggestions! According to your comments, we conduct additional experiments and provide the responses as follows:
Comment1:I would be curious whether transferring the knowledge from one task to another could be possible to some extent (not necessary from movie-dataset-1 to movie-dataset-2).
This is an interesting question! We conduct experiments to illustrate this. As shown in the table, we first train and AT-BMC (the state-of-the-art (SOTA) supervised rationalization approach) on the BoolQ dataset, and then evaluate them on Movie and MultiRC.
From the experimental results, we can see that both AT-BMC and cannot achieve good results.
The reason may be the data distributions are too different.
| Task | Token-F1 | AT-BMC | Task | Token-F1 | ||
|---|---|---|---|---|---|---|
| BoolQ-Movie | 48.7 | 19.4 | BoolQ-Movie | 50.3 | 19.3 | |
| BoolQ-MultiRC | 46.0 | 20.7 | BoolQ-MultiRC | 45.2 | 28.5 |
Comment2:The relationship between the number of augmented data vs task/rational performance is currently unclear. I would appreciate having a graph showing how the performance evolve according the number of added data
Here, we compare with different percentages (0-25%) of semantic DA on the Evidence Inference dataset.
From the experimental results, we can observe that both task and rationale performance are improving as the number of added data increases. Among them, the rationale performance improvement is more obvious and has a significant improvement when the number of added data reaches 15% of the total data.
| Method | 0 | 5 | 10 | 15 | 20 | 25 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | |
| + % random DA | 46.8 ± 0.3 | 26.8 ± 0.2 | 47.3±0.5 | 28.5±0.3 | 47.7±0.4 | 29.0±0.5 | 48.5±0.6 | 30.3±0.2 | 49.2±0.4 | 30.7±0.7 | 46.0±0.1 | 33.1±0.2 |
| + % semantic DA | 46.8 ± 0.3 | 26.8 ± 0.2 | 46.6±0.1 | 28.9±0.4 | 47.1±0.3 | 29.3±0.5 | 47.6±0.7 | 31.9±0.4 | 48.0±0.7 | 32.1± 0.5 | 48.7 ± 0.2 | 33.5 ± 0.4 |
Comment3: the clarity of the paper could be improved, especially section 3
Thank you for your suggestion, we will clarify the details of the method again. Besides, we have added a illustration of the algorithm in the revised version (Appendix A.1 and A.2).
Injecting Shortcuts into Prediction.
Based on the identified shortcuts, we hope that the model can disentangle the shortcuts features from the input ones during training, thus alleviating the problem of using shortcuts in the data for task prediction.Specifically.
1.The predictor adopts the input to yield task results.
2.We employ predictor adopts the shortcut tokens to algin a uniform class distribution:
where KL represents the Kullback–Leibler divergence, is the total number of classes, and denotes the uniform class distribution.
Based on the original loss , we add a “uniform” constraint to ensure the predictor identifies the shortcuts features as meaningless features and disentangles the shortcuts features from the input ones.
3.The learned features information in the supervised phase can be transferred into the unsupervised phase method through the shared encoder.
4.We denote SSR with this strategy as , and
the objective of can be defined as the sum of the losses:
.
Virtual Shortcuts Representations.
Since it is difficult for us to annotate the rationales with and further obtain shortcut tokens to improve the performance, we propose a virtual shortcuts representations strategy () with transferred shortcuts knowledge from as guidance.
Specifically, contains two phases (i.e., the supervised and unsupervised phase).
In the supervised phase:
1.We first adopt another predictor to predict task results based on the shortcuts , and ensure the encoder in captures sufficient shortcuts representations by minimizing :
2.Then, we employ an additional shortcut imitator which takes in as the input (denoted by for clarity) to align and mimic by minimizing the squared euclidean distance of these two representations:
where and share~parameters:
In the unsupervised phase:
1.We keep frozen and employ it to generate virtual shortcuts representations by taking in as the input.
2.After that, to encourage the model to remove the effect of shortcuts on task predictions, we adopt to match a uniform distribution by calculating:
where .
3.We set and share parameters to transfer the shortcut information into the predictor .
Formally, the final objective of is .
Comment4:The experiment section is lacking unsupervised baselines and standard datasets used in selective rationalization. In terms of dataset: beers, hotels, amazon, and the other tasks of ERASER.
Thank you very much for your suggestion!
First, our method is a semi-supervised method that requires partially labeled rationales. Since there exist no labeled rationales in both Beer and Hotel datasets, these datasets are not well suited for our task. To this end, we evaluate our method on FEVER (a dataset in ERASER [3]) and add some new baselines. Specifically, we add CAR [1] and 3Players [2] as our unsupervised baselines. Besides, we have compared the UNIREX methods you mentioned in the original manuscript. Please refer to Table 1 for details.
| FEVER | ||
|---|---|---|
| Task | Token-F1 | |
| Vanilla Un-RAT | 71.3±0.4 | 25.4±0.7 |
| IB | 84.7±0.0 | 42.7±0.0 |
| INVRAT | 83.6±1.8 | 41.4±1.4 |
| Inter-RAT | 85.1±0.5 | 43.0±0.8 |
| MCD | 84.4±0.6 | 44.6±0.2 |
| Vanilla Semi-RAT | 82.6±0.6 | 40.7±0.8 |
| IB (25% rationales) | 88.8±0.0 | 63.9±0.0 |
| WSEE | 84.3±0.3 | 44.9±0.5 |
| ST-RAT | 89.0±0.0 | 39.0±0.0 |
| Vanilla Sup-RAT | 83.6±1.4 | 68.9±0.9 |
| Pipeline | 87.7±0.0 | 81.2±0.0 |
| UNIREX | 81.1±0.8 | 70.9±0.5 |
| AT-BMC | 82.3±0.3 | 71.1±0.6 |
| 86.8±0.9 | 46.6±0.2 | |
| 87.1±0.4 | 47.0±0.5 |
| Movie | MultiRC | BoolQ | Evidence Inference | FEVER | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | |
| CAR | 84.8±0.2 | 28.3±0.5 | 60.1±0.4 | 26.6±0.8 | 63.9±0.7 | 19.3±0.4 | 45.8±0.3 | 8.8±0.6 | 83.8±0.5 | 40.9±0.7 |
| 3Player | 85.6±0.6 | 29.1±0.3 | 59.3±0.9 | 27.8±0.5 | 64.5±0.2 | 19.6±0.1 | 44.9±0.6 | 11.9±0.7 | 84.0±0.4 | 40.4±0.1 |
Comment5:How would perform sup-rat with data augmentation?
We have compared with baselines with the same data augmentation methods in Tabel2 and Appendix C.2. Below, we perform sup-rat with data augmentation and also add the results in the revision paper. From the results, we can observe that our SSR achieves competitive results, especially in Token F1 with the same amount of data.
| randomDA | Movie | MultiRC | BoolQ | Evidence | ||||
|---|---|---|---|---|---|---|---|---|
| Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | |
| Sup-RAT | 93.0±0.4 | 39.1±0.3 | 64.4±0.6 | 60.6±0.2 | 62.1±0.4 | 51.9±0.7 | 52.8±0.2 | 18.5±0.3 |
| 90.7±0.3 | 34.5±0.1 | 63.6±0.5 | 56.1±0.3 | 61.3±0.7 | 48.3±0.5 | 46.0±0.1 | 33.1±0.2 | |
| 92.8±0.2 | 36.7±0.2 | 65.4±0.2 | 44.3±0.4 | 58.3±0.6 | 47.7±0.3 | 46.5±0.3 | 32.4±0.2 |
| semanticDA | Movie | MultiRC | BoolQ | Evidence | ||||
|---|---|---|---|---|---|---|---|---|
| Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | |
| Sup-RAT | 92.9±0.3 | 39.9±0.3 | 65.1±0.4 | 59.9±0.7 | 62.2±0.3 | 52.3±0.5 | 53.1±0.1 | 19.9±0.3 |
| 90.7±0.2 | 35.6±0.2 | 64.7±0.7 | 42.7±0.4 | 58.0±0.3 | 50.2±0.3 | 48.7±0.2 | 33.5±0.4 | |
| 87.6±0.3 | 36.9±0.1 | 66.2±0.5 | 49.8±0.4 | 61.1±0.3 | 48.8±0.2 | 46.5±0.4 | 31.1±0.2 |
Comment6:Could you also report metrics regarding comprehensiveness and sufficiency to assess the improvement of the proposed approach to decrease spurious correlation?
Thank you for bringing new metrics to us! We use the comprehensiveness and sufficiency metrics from ERASER [3], and report the corresponding results in the table. As shown in table, and still performs better than baselines.
| IMDB | SST-2 | |||
|---|---|---|---|---|
| Model | suff () | com () | suff () | com () |
| Vanilla Un-RAT | 0.22 | 0.33 | 0.38 | 0.18 |
| Vanilla Semi-RAT | 0.15 | 0.40 | 0.24 | 0.23 |
| WSEE | 0.13 | 0.42 | 0.20 | 0.26 |
| 0.10 | 0.41 | 0.14 | 0.32 | |
| 0.07 | 0.45 | 0.11 | 0.35 |
References
[1]Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. A game theoretic approach to class-wise selective rationalization. In NeurIPS2019.
[2]Mo Yu, Shiyu Chang, Yang Zhang, and Tommi Jaakkola. Rethinking cooperative rationalization: Introspective extraction and complement control. In EMNLP2019.
[3]Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher and Byron C. Wallace. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In ACL2020.
Comment7:Lack of human evaluation
Specifically, we randomly select 100 samples from the Movie dataset by comparing with Inter_RAT and WSEE. From the observation in Table, we can find outperforms Inter_RAT and WSEE in all metrics, illustrating the effectiveness of .
Besides, in our original manuscript, we used ChatGPT for the subjective evaluation (Appendix C.3).
From the results, we find that our model achieves a similar partial order relationship between human evaluation and ChatGPT evaluation (Appendix C.3). This further illustrates the effectiveness of our method.
| Usefulness | Completeness | Fluency | |
|---|---|---|---|
| Inter_RAT | 3.69 | 3.53 | 3.88 |
| WSEE | 3.82 | 3.78 | 4.05 |
| 3.90 | 3.88 | 4.20 |
Detailed standards for human evaluation:
Usefulness
Q: Do you think the selected rationales can be useful for explaining the predicted labels?
• 5: Exactly. Selected rationales are useful for me to get the correct label.
• 4: Highly useful. Although several tokens have no relevance to correct label, most selected tokens are useful to explain the labels.
• 3: Half of them are useful. About half of the tokens are useful for getting labels.
• 2: Almost useless. Almost all of the tokens are useless.
• 1: No Use. The selected rationales are useless for identifying labels.
Completeness
Q: Do you think the selected rationales are enough for explaining the predicted labels?
• 5: Exactly. Selected rationales are enough for me to get the correct label.
• 4: Highly complete. Several tokens related to the label are missing.
• 3: Half complete. There are still some important tokens that have not been selected, and they are in nearly the same number as the selected tokens.
• 2: Somewhat complete. The selected tokens are not enough.
• 1: Nonsense. None of the important tokens is selected
Fluency
Q: Do you think the selected rationales are fluent?
• 5: Very fluent.
• 4: Highly fluent.
• 3: Partial fluent.
• 2: Very unfluent.
• 1: Nonsense.
Comment8: Missing related work
Thank you for your suggestion, we add the mentioned related work to the revised version of the paper.
The paper "Boosting selective rationalization with shortcuts discovery" proposes an extension to selective text rationalisation methods by using so-called shortcuts in analysis and prediction for text. Here, rationalisation is an attempt to find fragments that influence the final classification The authors note that frequently, in unsupervised approaches, algorithms search for so-called shortcuts, which, while they may be strongly, but spuriously, correlated with the final classification, do not in any way explain the real reasons for a given classification. They therefore suggest a combination of supervised algorithms, where the true rationales (input elements influencing the classifications) are predefined by experts, and unsupervised methods, where the rationales are searched for. The proposal is thus to exclude those unsupervised rationalizations found that are not defined by experts. By excluding these unnecessary shortcuts, the proposed SSR algorithm achieves results that are similar to SOTA approaches, beating many other approaches.
The use of ChatGPT in appendix C.2 where it, essentially, selects the proposed approach over other methods, is nice and might be entertaining, but it does not introduce anything to the problem at hand. I would remove it, if I were you. But you, naturally, may do as you please.
The work on this subject is clearly very much needed these days. On the other hand, I think that this paper does not introduce new ideas. The use, or rather the exclusion of the “shortcuts” in a model, does not introduce enough new knowledge to push the model prediction understanding much.
优点
The authors consider an important problem of rationalisation, understood as the selection of the parts of the classified input that have the greatest impact on the classification of the problem. The issue considered is of the natural language processing tasks. The solution is described in great detail in the form of a mathematical derivation, which is a strength, but in too much detail makes the article hard to read sometimes. It has the advantage of attempting to combine supervised and unsupervised approaches in order to exclude as found rationalisations those fragments that only have spurious correlations with the output, but do not explain anything.
缺点
- The concept of a shortcut itself is very vague, and the definition and proposed selection algorithm (as described above by discarding the undefined) is too simplistic. The authors use one example of such a spurious correlation that does not describe much of the decision much, throughout all the article. 2 The entire article is written in a way that is difficult to understand. Lots of equations, with several variables with stacked indices, reduces the readability and the clearness of what the authors want to achieve.
- There is no clearly stated hypothesis at the beginning of the text. The approach may be obvious to the author, but readers will not understand and will abandon reading before the end.
- The authors introduce a number of loss functions which can be used in different configurations, with no clear intuition which should be used and why.
- The authors introduce “data augmentation” DA approaches into their model. However, they seem to have forgotten the MixedDA solution, which is in the tables but not in the description (section 3.3). The models with and without augmentation are compared, and in some cases of algorithms or data one type of augmentation gives better results, but these results are not consecutive (see table 1). It seems to me that the augmentation ideas not really matter. The differences are small, in any case.
- The text introduces a great deal of patterning, both when describing existing methods and the author's own proposal. This does not make it easy to read, as many of them do not explain the next steps in any way, such as the definition of Gumbel-softmax on page 3.
问题
- The definition and suggested algorithm for selection (as described above by rejecting the undefined) is too simplistic. The authors use only one example ('received a lukewarm approach' in the film review) throughout the article. Doesn't such a solution reduce the proposal to a supervised approach? Please use another example of a shortcut. 2 The entire article is written in a way that is difficult to understand. There is no clearly stated hypothesis at the beginning of the text. The approach may be obvious to the author, but readers will not understand and will abandon reading before the end. Could you clearly state your hypothesis in the introduction?
- Is the whole of section 2 your proposition, or the definition of possible solutions used now? It is not clear.
- A number of cost functions are given that are to be utilized in different configurations. Since the background models are quite complex (Transformer, both as encoder and encoder/predictor, as well as other generative models) these loss functions tend to be complex too. Some might be removed with more intuition on the more important in exchange.
- What actually is the “shortcut imitator” (pages 5 and 6 and later), what is it used for?
- Please explain what is the MixedDA augmentation in subsection 3.3.
- In the results tables, these with the best value are bold-faced as the best. But the mean differences may be as low as 0.1%, which statistically are not, in any way, significant. The authors should perform some statistical analysis and group the algorithms into groups of statistical equivalence (see e.g. Demsar, Statistical comparisons of classifiers over multiple data sets, JMLR 7, 2006; open software for that approach is available).
Less essential:
- Difficulty in reading may come from poor English. I suggest the involvement of a native speaker.
- Instead of very many formulas, drawing diagrams of the methods should be shown, which would explain much more. Also, nothing is contributed by the detailed descriptions of all cost functions in the description of the general methods as well as the own proposal.
- in table 3 the result 90.7 for SSR_unif with DA is chosen as best, even though that of WSEE with DA of 91.0 mean seems better. That is perhaps a typing error, or is it?
We appreciate your comments! To address your concerns, below we prudently justify the details of our proposed method and conduct more experiments.
Comment1: The definition and suggested algorithm for selection is too simplistic.
Thank you for your comments, below we present the details of the definition and suggested algorithm for selection. Definition: Given the text input , the goal of selection is to first generate a mask variable , where indicates whether the -th token is a part of the rationale. Finally, the selection function selects the rationales as . Briefly, the goal of the selection is to extract a subsequence from the original text to be utilized as rationales to support the prediction results.
Algorithm: In order to select whether each token is rationale or not, we reduce the rationale extraction algorithm to a token-level binary classification task, i.e., predicting the probability that a token is a rationale token. Specifically, we map each token to the probability of being selected as part of rationale: . Among them, represents an encoder encoding into a -dimensional vector, and .
Then, in order to train the model to be able to extract rationales, there are different training approaches for unsupervised and supervised rationalization:
For unsupervised rationalization, due to the lack of labeled rationales, the selection function must be cascaded with the predictor, where the rationale learning signal for unsupervised rationalization must rely on comparing the prediction of the predictor with the ground-truth task label.
For supervised rationalization, since the labeled rationale is known, we can use the token-level binary cross-entropy (BCE) cost function to train supervised rationalization.
Comment2: The authors use only one example ('received a lukewarm approach' in the film review) throughout the article. Doesn't such a solution reduce the proposal to a supervised approach? Please use another example of a shortcut.
First, the shortcut does not reduce the proposal to a supervised approach. Specifically, the example we give in the paper is a real example from the Movie dataset, where the shortcut 'received a lukewarm approach' is obtained by our shortcut discovery strategy, not by a priori human annotation. Besides, it is noted that our shortcut discovery strategy relies on the rationale of the true annotation. However, since the rationale annotation is difficult to obtain for most tasks, shortcuts can not reduce the proposal to a supervised approach in most cases. Therefore, in our SSR, we use a semi-supervised approach. Finally, we also visualize some of the shortcuts obtained by our shortcut discovery strategy in Appendix C.5, which we further illustrate one of them here:
We visualize another example on Movies (the label is positive), where the italic tokens represent the real rationales, and the bolded ones are the predicted rationales.:
| Model | Visualized example | Predicted label |
|---|---|---|
| Vanilla Un-RAT | Mozart is a famous musician and amadeus is a biographical film about him , amadeus is a true work of art . it is one of those few movies of the 80 ' s that will be known for its class , its style , and its intelligence. why is this such a good film... | positive |
| Mozart is a famous musician and amadeus is a biographical film about him , amadeus is a true work of art . it is one of those few movies of the 80 ' s that will be known for its class , its style , and its intelligence. why is this such a good film... | positive |
From the results, we can find although both Vanilla Un-RAT and predict the label as positive correctly, Vanilla Un-RAT still extracts shortcuts as rationales. Specifically, "Mozart is a famous musician" is the shortcuts. Although Mozart was a great musician, it has no relevance to how good his biographical film is. avoids these shortcuts, but Vanilla Un-RAT extracts these as rationales.
Comment3: Could you clearly state your hypothesis in the introduction?
Thank you very much for your suggestion, putting our assumptions in the introduction is really beneficial for readers to read. Therefore, we have revised the introduction (the blue text) in the revised version and uploaded it to ICLR. Please review it, and if there are any problems, please let us know.
Comment4: Is the whole of section 2 your proposition, or the definition of possible solutions used now? It is not clear.
Thank you again for your suggestions. Based on your suggestions, we have organized section 2 and updated it in the revised version.
Comment5: A number of cost functions are given that are to be utilized in different configurations. Since the background models are quite complex (Transformer, both as encoder and encoder/predictor, as well as other generative models) these loss functions tend to be complex too. Some might be removed with more intuition on the more important in exchange.
In rationalization studies, several deep learning models such as bert and other Transformer based methods are widely used in order to get better performance. In order to make a fair comparison with these methods, we also use bert as the encoder/predictor in our paper. Although this will increase the complexity of the model, performance improvement can be obtained. In the future, we will also further investigate how to reduce the complexity of the model while maintaining the performance.
In addition, our loss functions can be divided into two parts. The first part is the traditional supervised\semi-supervised\unsupervised losses, such as Eqs. 2, 3, and 4. These losses cannot be simplified.
The second part is the newly added loss functions for SSR. Specifically, for Strategy 1, compared to the traditional loss, we only add Eq. 6 to ensure that the predictor identifies the shortcuts features as meaningless features, thus mitigating the effect of shortcut on the model's prediction. KL divergence in Eq. 6 is very easy to implement and does not increase the complexity of the loss function.
For Strategy 2, we introduce a shortcut imitator, which assumes the role of learning shortcuts (in the supervised phase) and shortcut simulation (in the unsupervised phase). First, in the supervised phase, we use Eq.7 to capture the features of the shortcut. Essentially, Eq.7 is a cross-entropy loss function like . Afterward, in order to learn the shortcut knowledge, we introduce the shortcut imitator to fit the distribution of shortcuts by using the original text as input with Eq. 8 (a simple MSE loss). Finally, in the unsupervised phase, since the shortcuts are unknown, we can use the shortcut imitator to obtain a potential shortcut representation and map it to a uniform distribution using Eq. 9 to identify the shortcuts features as meaningless ones, where the operation is similar to Eq. 6. Compared with the traditional method, we have added three loss functions, but our results are also significantly improved, which shows that all these loss functions are important for SSR. In the future, we will also explore new methods to reduce the number of loss functions to simplify the model.
Thank you for your suggestions!
Comment6: What actually is the “shortcut imitator” (pages 5 and 6 and later), what is it used for?
In the supervised phase, we can get potential shortcuts using the shortcut discovery strategy. However, in the unsupervised phase, it is difficult to obtain potential shortcuts because we lack human labeled rationale tokens. To this end, we introduce the shortcut imitator that is expected to learn the representation of shortcuts in the supervised phase and migrate the learned shortcuts information to the rationale extraction in the unsupervised phase.
Comment7: Please explain what is the MixedDA augmentation in subsection 3.3.
We are very sorry that we neglected to describe MixedDA augmentation. MixedDA augmentation refers to mixing data obtained from random DA with data obtained from semantic DA to achieve data augmentation.
Comment8: In the results tables, these with the best value are bold-faced as the best. But the mean differences may be as low as 0.1%, which statistically are not, in any way, significant. The authors should perform some statistical analysis and group the algorithms into groups of statistical equivalence (see e.g. Demsar, Statistical comparisons of classifiers over multiple data sets, JMLR 7, 2006; open software for that approach is available).
From a statistical point of view, there is no significant difference between SSR and its data augmented variants. However, in this work, we are more interested in how effective the models are in practice. For example, SSR+random DA outperforms SSR on most metrics, but the statistical difference is not significant. In practical applications, we prefer SSR+random to SSR. Because in some scenarios where decision making is important and the amount of data is large, even a minor improvement may bring huge economic returns. Therefore, we prefer models with the excellent effectiveness of extraction and prediction in practice. In the future, we will also explore new counterfactual data augmentation methods to further improve the performance of SSR+DA.
Comment9: The text introduces a great deal of patterning, both when describing existing methods and the author's own proposal. This does not make it easy to read, as many of them do not explain the next steps in any way, such as the definition of Gumbel-softmax on page 3.
Gumbel-Softmax is a commonly used technique for handling discrete data in generative models and optimization problems. It combines the Gumbel distribution and the Softmax function to sample from a discrete probability distribution. The key idea is to introduce noise from the Gumbel distribution and then transform this noise into a sample from a discrete distribution using the Softmax function.
The process of Gumbel-Softmax can be summarized as follows:
Sampling noise from the Gumbel distribution: First, a noise vector is sampled from the Gumbel(0, 1) distribution, where is the location parameter and is the scale parameter. This noise vector introduces randomness into the sampling process.
Computing the Gumbel-Softmax sample: Next, the noise vector is added to the logarithmic values of the discrete probability distribution . Then, the Softmax function is applied to obtain a sample from the discrete distribution. Specifically, for the logarithmic value of the discrete probability distribution, the Gumbel-Softmax sample is calculated as follows:
where is the temperature parameter that controls the smoothness of the sample. Higher temperature values result in smoother samples, while lower temperature values tend to produce more one-hot vectors that represent discrete values. By adjusting the temperature parameter , the randomness and smoothness of the Gumbel-Softmax samples can be controlled. As approaches , the sample tends to be a one-hot vector, where only one element is and the rest are , resembling maximum likelihood estimation. As approaches positive infinity, the sample tends to approach a uniform distribution, where all elements have equal probabilities.
The benefit of Gumbel-Softmax is that it provides a differentiable approximation for sampling discrete variables, making it compatible with optimization algorithms such as gradient descent. This property makes it applicable in unsupervised rationalization to sample rationale tokens.
Comment10: Difficulty in reading may come from poor English. I suggest the involvement of a native speaker. & Instead of very many formulas, drawing diagrams of the methods should be shown, which would explain much more. Also, nothing is contributed by the detailed descriptions of all cost functions in the description of the general methods as well as the own proposal.
Thank you very much for your suggestion, we will invite native speakers to help us embellish the paper in the future, and will further draw diagrams of the methods to explain the process of SSR.
Comment11: In table 3 the result 90.7 for SSR_unif with DA is chosen as best, even though that of WSEE with DA of 91.0 mean seems better. That is perhaps a typing error, or is it?
Thank you, this is a typing error. We have modified in the revised version.
Comment12: The use of ChatGPT in appendix C.2 where it, essentially, selects the proposed approach over other methods, is nice and might be entertaining, but it does not introduce anything to the problem at hand. I would remove it, if I were you.
Thank you very much for your suggestion. Our goal of using ChatGPT is to use ChatGPT as a replacement for human subjective evaluation of extracted rationales. Below, we show our added human evaluation experiment:
Specifically, we randomly select 100 samples from the Movie dataset by comparing with Inter_RAT and WSEE. From the observation in Table, we can find outperforms Inter_RAT and WSEE in all metrics, illustrating the effectiveness of .
Meanwhile, we find that our model achieves a similar partial order relationship between human evaluation and ChatGPT evaluation (Appendix C.3). This further illustrates the effectiveness of our method.
| Usefulness | Completeness | Fluency | |
|---|---|---|---|
| Inter_RAT | 3.69 | 3.53 | 3.88 |
| WSEE | 3.82 | 3.78 | 4.05 |
| 3.90 | 3.88 | 4.20 |
Detailed standards for human evaluation:
Usefulness
Q: Do you think the selected rationales can be useful for explaining the predicted labels?
• 5: Exactly. Selected rationales are useful for me to get the correct label.
• 4: Highly useful. Although several tokens have no relevance to correct label, most selected tokens are useful to explain the labels.
• 3: Half of them are useful. About half of the tokens are useful for getting labels.
• 2: Almost useless. Almost all of the tokens are useless.
• 1: No Use. The selected rationales are useless for identifying labels.
Completeness
Q: Do you think the selected rationales are enough for explaining the predicted labels?
• 5: Exactly. Selected rationales are enough for me to get the correct label.
• 4: Highly complete. Several tokens related to the label are missing.
• 3: Half complete. There are still some important tokens that have not been selected, and they are in nearly the same number as the selected tokens.
• 2: Somewhat complete. The selected tokens are not enough.
• 1: Nonsense. None of the important tokens is selected
Fluency
Q: Do you think the selected rationales are fluent?
• 5: Very fluent.
• 4: Highly fluent.
• 3: Partial fluent.
• 2: Very unfluent.
• 1: Nonsense.
Dear authors,
thank you for all the detailed replies to my comments. Generally, I am satisfied, here are some more questions to your comments :
comments 1 -- 4: generally I am satisfied, as it is true that the shortcuts do not reduce the problem to a supervised solution. The intro is better organised now.
Comment 5: the whole construction of Eqs (20) - (6) seems viable now. My comment was more connected with Eqs (7) - (8) and the concept of a “shortcut imitator” which is still not clear as long as you read the text without your comment above. Perhaps add the idea explanation to the text, at the same time making use of the diagrams. At the same time make the diagrams larger since some are unreadable, e.g. figs 2 and 3, at the cost of some spurious and too detailed definitions.
Comment 8: If there are no significant differences, then simply say so, instead of bold-facing your algorithm's results in the tables? The methods I have suggested could divide the approaches into groups of statistically equal methods. And, as I wrote, these statistical methods are simple with open software available.
Comment 9: Perhaps my question was not clear, mea culpa, as the Gumbel-softmax is well known. But it is great you have added it in the appendix.
Comment 12: It is very well that you have now extended the ChatGPT part. This might be true that the use of ChatGPT might be more viable than a use of some small group of people to validate. What was the number of people in the human results group that you have added too? It would be welcome to add too.
I am somewhat satisfied with your corrections and shall add some points to my evaluation
The paper solves selective rationalization, an NLP problem where the aim is to find a piece of text from the input called "rationale" that directly justifies the label (selector problem), then use the rationale as input to do classification (predictor problem). Unlike "rationales", there are are pieces of text called "shortcuts" that can result in correct prediction but are not proper justifications of the label. The exact difference between rationales and shortcuts is not clearly defined in the paper, so it seems that the distinction is problem-specific and ultimately something that is left to the practitioner. Previous selective rationalization methods can be categorized into supervised, where the rationales are provided during training, and unsupervised, where they have to be inferred during training process. The paper proposes a semi-supervised approach (an extension of a couple of recent papers) where a selector model trained in the unsupervised way is applied to smaller supervised data to identify the shortcuts and retrain the selector model. In addition, two data augmentation methods were proposed to enrich the shortcut discovery. The proposed approach is evaluated in 4 datasets. The model outperforms previous SOTA unsupervised and semi-supervised methods on both rationale prediction and label prediction tasks. The performance is also comparable to the supervised baselines
优点
- The paper introduces innovative techniques to prevent the model from erroneously considering shortcuts as rationales during predictions. Since the objective is to answer the question “Which text span leads to the final conclusion”, this approach can be beneficial when there are misleading shortcuts in the text. Such predictions can help us analyze the reasoning ability of large language models (LLM).
- The authors conducted comprehensive experiments on more than 10 variants of the proposed approach. Informative discussions are also presented. Tests are performed on four datasets from diverse domains and results are compared with SOTA baselines from all 3 groups of previous methods. Such rich experiments clearly show how each add-on piece affects the overall model performance. The limitations and potential reasons for certain outcomes are also deeply analyzed and discussed.
- The data augmentation opens up a new approach to enriching labeled data in this area. Instead of using LLMs to augment new instances, which can be inefficient in both computing source and cost, the authors propose to use random/similar tokens to replace shortcuts. Such simple approaches, especially the random one, surprisingly yield promising results.
缺点
- The task of finding shortcuts remains ambiguous. It's unclear whether shortcuts always exist in the text. If they aren’t, what would the model predict? The usage of this system seems limited.
- There's a need for a more robust analysis of prior methodologies, particularly the unsupervised ones. The author asserts that unsupervised methods frequently identify shortcuts as inefficient rationales and provides examples. Yet, the reader would be curious about how often that happens, and if the model already yields good results, why do we have to give up shortcuts? I suggest a stronger argument for why finding a good rationale is significant, such as it can be helpful for other reasoning tasks.
- The manuscript's writing style can be perplexing in several sections, notably in Methodology (Section 3). Mathematical expressions should be consistent, straightforward, and clear, and should only be included when indispensable. For example, instead of writing an algorithm for the semantic data augmentation, the reader might be more interested in seeing how specifically you do the semanticly similar word retrieval in Appendix A.
- It is hard to understand why sharing selector parameters in both supervised/unsupervised phases is not the default setting. Since the claim is that this is a specific setting in the proposed approach, it would be important to learn why the previous methods are not doing so.
问题
- Elaborating on data augmentation might be beneficial. For instance, details about the number of augmented instances introduced and the ideal quantity would be insightful.
- The term datastore can be confusing when discussing random data augmentation since this word is also used when describing the semantic key-value pair in semantic data augmentation.
Thank you for your insightful suggestions! According to your comments, below we clarify the misunderstandings, and conduct more experiments.
Comment 1: It's unclear whether shortcuts always exist in the text. If they aren’t, what would the model predict?
We argue that shortcuts should always exist in texts such as entity bias [1,2] and statistical bias [3]. Previous studies [4,5,6,7] that study shortcuts in texts also support our viewpoint. Moreover, from the causal graph we constructed in Appendix D, we observe that since there always exists a backdoor path S↔Z→Y which makes shortcut Z and the label Y spuriously correlated, we conclude shortcuts always exist in the text. Finally, we assume that some datasets have been processed without significant shortcuts, enabling the model to learn the true causal relationships within the data and function as a debiased model. However, this assumption is overly idealistic when applied to real-world datasets.
Comment2: The author asserts that unsupervised methods frequently identify shortcuts as inefficient rationales and provides examples. Yet, the reader would be curious about how often that happens.
As in the answer to comment 1, shortcuts commonly occur in text. Meanwhile, [5] has shown that the vanilla rationalization criterion in unsupervised methods is prone to highlighting spurious correlations (shortcuts) between the input features and the output as valid explanations. Therefore, we argue that unsupervised methods will frequently identify shortcuts as inefficient rationales.
Comment3: If the model already yields good results, why do we have to give up shortcuts? I suggest a stronger argument for why finding a good rationale is significant, such as it can be helpful for other reasoning tasks.
This is a valuable question! If the shortcut helps us get better results, do we still need to remove the shortcut? Our answer is yes. Specifically, when faced with test data that is distributed with the same distribution as the training data, the shortcut facilitates us to obtain accurate results. However, since shortcuts do not reflect the real causal relationship between the data and labels, when the test data is distributed differently from the training data (i.e., the shortcut does not exist in the test data), a model that relies on shortcuts will have poor prediction results. For example, in the following table (Table 3 in the original manuscript), Vanilla Un-RAT achieves promising results when we test it on an identically distributed dataset IMDB, but it achieves poor predictive results on out-of-distribution data SST-2. Thus, we can conclude that when the model relies on the shortcut in the data for training and prediction, the model fails in the face of new data distributions, further limiting the application of the model.
| Methods | IMDB(ID) | SST-2(OOD) |
|---|---|---|
| Vanilla Un-RAT | 85.3±0.2 | 45.3±8.1 |
| SSR_unif | 90.3±0.2 | 79.4±0.3 |
| SSR_virt | 89.9±0.2 | 79.9±0.4 |
Furthermore, finding a good rationale is significant:
- Extracting a good rationale can help us avoid extracting shortcuts as rationales, which can help the model focus on the real causal relationship between the data and labels during training and prediction. It can increase the accuracy and robustness of the model when facing the test data with different distributions.
- Extracting a good rationale can improve the interpretability of the prediction results, which can be applied in some high-risk domains, such as law [4].
- Rationalization can help other reasoning tasks. For example, rationalization is used as a knowledge extractor in [8] to extract the most responsible features for the predictions. After that, [8] expands the extractive rationales using commonsense resources to generate natural language explanations and give the final prediction.
References
[1] Yongchun Zhu, Qiang Sheng, Juan Cao, Shuokai Li, Danding Wang, Fuzhen Zhuang. Generalizing to the Future: Mitigating Entity Bias in Fake News Detection. In SIGIR2022.
[2] Fei Wang, Wenjie Mo, Yiwei Wang, Wenxuan Zhou, and Muhao Chen. A Causal View of Entity Bias in (Large) Language Models. In Arixv2023.
[3] Xiaobao Wu, Chunping Li, Yishu Miao. Discovering Topics in Long-tailed Corpora with Causal Intervention. In ACL2021.
[4] Linan Yue, Qi Liu, Li Wang, Yanqing An, Yichao Du, Zhenya Huang. Interventional Rationalization. In EMNLP2023.
[5] Shiyu Chang, Yang Zhang, Mo Yu and Tommi Jaakkola. Invariant Rationalization In ICML2020.
[6] Wei Liu and Jun Wang, Haozhao Wang, Ruixuan Li, Zhiying Deng, YuanKai Zhang, and Yang Qiu. D-Separation for Causal Self-Explanation. In NeurIPS2023.
[7] Du M, Manjunatha V, Jain R, et al. Towards Interpreting and Mitigating Shortcut Learning Behavior of NLU models. In NAACL2021.
[8] Bodhisattwa Prasad Majumder, Oana-Maria Camburu, Thomas Lukasiewicz, Julian McAuley. Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations. In ICML2022.
Comment4: Instead of writing an algorithm for the semantic data augmentation, the reader might be more interested in seeing how specifically you do the semanticly similar word retrieval in Appendix A.
Given a word and its semantic representation , and a database to be retrieved, our goal is to retrieve the word in the database that is semantically closest to the word .
In the retrieval process, we first calculate the L2 distance between the semantic representation of the retrieved word and each word in the database. We consider the smaller the L2 distance, the closer an input semantically to another. Finally, we choose the word with the closest L2 distance to the retrieved word as the semantically similar word. In the specific code implementation, we employ FAISS [1] to achieve this retrieval goal.
Comment5: The term datastore can be confusing when discussing random data augmentation since this word is also used when describing the semantic key-value pair in semantic data augmentation
Thanks for your suggestion! We revise the description of the Random Data Augmentation method to the following:
As we have identified the potential shortcuts in , we can replace these shortcuts tokens with other tokens which are sampled randomly from the datastore . Among them, the database contains all tokens of and (i.e., ).
Comment6: It is hard to understand why sharing selector parameters in both supervised/unsupervised phases is not the default setting. Since the claim is that this is a specific setting in the proposed approach, it would be important to learn why the previous methods are not doing so.
We emphasize the sharing of selector parameters in the supervised/unsupervised phase in order to allow unsupervised rationalization and supervised rationalization to better learn from each other. The same setup is used for Vanilla Semi-RAT and WSEE in the replication of them.
Comment7: Elaborating on data augmentation might be beneficial. For instance, details about the number of augmented instances introduced and the ideal quantity would be insightful.
In Table 1, for random data augmentation, we augment 25% of the data in the original dataset. For example, in Movie, there are 1600 training data. Based on this, we augment 400 instances (i.e., 1600*0.25=400). Similarly, for semantic data augmentation, we augment the data with 25% of the original dataset. In mixed data augmentation, we mix the data obtained from random DA with the data obtained from semantic DA, which is equivalent to adding 50% of the data of the original dataset.
In Figure 4, we test the performance of SSR with data augmentation on the MultiRC dataset. In this experiment, we set the percentage of data augmentation to 5%, 10%, 15% and 25% respectively. We find that SSR with data augmentation performs best when the data augmentation percentage is 15%.
Then, we additionally investigate SSR with data augmentation on the Evidence Inference dataset. From the experimental results, we observe that SSR with data augmentation performs best on the Evidence Inference dataset when the data augmentation percentage is 25%.
| Method | 0 | 5 | 10 | 15 | 20 | 25 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | Task | Token-F1 | |
| + % random DA | 46.8 ± 0.3 | 26.8 ± 0.2 | 47.3±0.5 | 28.5±0.3 | 47.7±0.4 | 29.0±0.5 | 48.5±0.6 | 30.3±0.2 | 49.2±0.4 | 30.7±0.7 | 46.0±0.1 | 33.1±0.2 |
| + % semantic DA | 46.8 ± 0.3 | 26.8 ± 0.2 | 46.6±0.1 | 28.9±0.4 | 47.1±0.3 | 29.3±0.5 | 47.6±0.7 | 31.9±0.4 | 48.0±0.7 | 32.1± 0.5 | 48.7 ± 0.2 | 33.5 ± 0.4 |
From the above experiments, it is found that the optimal data augmentation percentage is different for different datasets. Since our data augmentation method relies on human labeled rationales and most semi-supervised methods assume that the labeled rationales percentage is 25%, we also use 25% as our percentage in our practical implementation.
References
[1] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. In IEEE Transactions on Big Data 2019.
We sincerely appreciate all reviewers' time and efforts in reviewing our paper. We would like to thank all reviewers for providing many insightful and valuable suggestions. Here is a summary of our updates:
Clarification: We justify the motivation (2gye,cE5E) and technical detail (cE5E,LMAN), and add key algorithms for better clarity (LMAN).
More Experiments: We add new baselines, datasets and metrics (LMAN) to validate the effectiveness of SSR. Also, we make a human evaluation to evaluate the extracted rationales (LMAN).
We've highlighted the updates (blue text) in the revision manuscript. We hope our responses can clarify all your confusion and alleviate all concerns. We thank all reviewers' time again. Looking forward to your reply!
Hey bro! First of all, congratulations on such a good job!
I noticed that you employed the shared encoder trick for the selector and predictor. This technique is non-trivial and originates from our prior work, specifically the paper titled "FR." In our publication, we theoretically analyzed how the encoder for the selector serves as a beneficial regularizer for the predictor. Our experiments demonstrated a significant enhancement in the performance of unsupervised rationale extraction with this particular trick, a finding further corroborated by experiments in another ICML 2023 paper (https://arxiv.org/abs/2306.14115).
While we appreciate your citation of our work, it seems that there is a lack of substantial discussion regarding this method in your paper. We would like to kindly request that you consider incorporating more details about the origin and effectiveness of this approach in your discussion. Additionally, it would be valuable if you could include an ablation study to showcase the impact on results when this trick is omitted.
We believe that such additions would contribute to a more comprehensive understanding of the methodology and its implications. Thank you for your attention to this matter, and we look forward to the possibility of furthering the discourse on this shared research.
Thanks for your reply. I now know that your implemented Vanilla Semi-RAT also uses this setup, so I think my question has been answered. However, I would like to take this opportunity to point out the special contributions of our work "FR".
While I acknowledge that ST-RAT also utilizes a shared encoder, it should be noted that it does not fall under the category of rationalization. In the realm of rationalization, the classifier's input should be a rationale, which ST-RAT does not adhere to. In fact, ST-RAT cannot even be classified in the XAI domain. Its classification and rationale extraction are two entirely independent tasks. One could interpret ST-RAT as performing two separate tasks—text classification (although NLU in the original paper) and NER—simultaneously through the concept of multi-task learning, leveraging two sets of annotated data for training (classification and NER). The performance improvement resulting from multi-task learning in this case is a straightforward consequence of simultaneously utilizing labeled data for both tasks.
In (unsupervised) selective rationalization, the extraction of rationales and the prediction of the classifier form a pipeline task. The only available labels are for classification, and there has been no prior analysis on how sharing encoders in a pipeline (beyond the scope of the rationalization pipeline) enhances performance or the mechanisms by which performance improvement occurs. Clearly, the pipeline task is distinct from multi-task learning.
We will be grateful if you ccould discuss its effectiveness when you employ this trick. For instance, you might consider phrasing such as, "Liu et al. (2022) developed a unified encoder to induce a better predictor by accessing valuable information blocked by the selector." as seen in https://arxiv.org/abs/2306.14115.
Thank you for your comment! We are glad that our answer addresses your question! The shared encoder is an effective trick in rationalization. We will discuss the effectiveness of your work "FR" [1] further in the future.
References
[1] Wei Liu, Haozhao Wang, Jun Wang, Ruixuan Li, Chao Yue, YuanKai Zhang. FR: Folded rationalization with a unified encoder. In NeurIPS2022.
Thanks for your public comment. The shared encoder trick for the selector and predictor is a classical operation, and a similar empirical trick was used in ST-RAT [2] before FR [1]. Since our method is a semi-supervised method and ST-RAT is our important semi-supervised baseline, we also use the shared encoder in SSR. Similarly, the same setup is used for Vanilla Semi-RAT and WSEE in the replication of them.
Finally, we argue that the shared encoder is a common trick, especially after you theoretically prove that it works. However, this is not the main central point of this paper. Our goal in this paper is to address the problem that models may exploit shortcuts in the data to make predictions and compose rationales. To solve this problem, in this paper, we propose a shortcuts-fused method by incorporating shortcuts explicitly with a shortcut discovery approach.
References
[1] Wei Liu, Haozhao Wang, Jun Wang, Ruixuan Li, Chao Yue, YuanKai Zhang. FR: Folded rationalization with a unified encoder. In NeurIPS2022.
[2] Meghana Moorthy Bhat, Alessandro Sordoni, and Subhabrata Mukherjee. Self-training with few-shot rationalization: Teacher explanations aid student in few-shot nlu. In EMNLP2021.
Dear Reviewers,
Thanks again for your insightful comments and valuable suggestions, which are of great help to improve our work. We sincerely hope that our rebuttal has properly addressed your concerns. If not, please let us know your further concerns, and we will continue actively responding to your comments and improving our paper. We are looking forward to your further responses and comments.
Best,
Authors
Dear reviewers,
Thanks again for your valuable and constructive suggestions! We have revised our paper following your suggestions and added more experiments to make our contributions clearer. Do you have further comments on our paper and responses? If our responses have addressed your concerns, could you please kindly consider increasing the overall score?
Best,
Authors
Dear Reviewers and Authors,
The discussion phase ends soon. Please check all the comments, questions, and responses and react appropriately.
Thank you!
Best, AC for Paper #9216
Hi, sorry to bother you, but I have some questions about the experimental details. I noticed in your paper that you used INVRAT as a baseline and got quite better results than several other methods. This method involves invariant learning, which requires partitioning the dataset into two different environments. In the field of invariant learning, the way to partition the environments depends on the dataset, and different datasets may require different partitioning methods. This detail is critically important. However, you seem to have left this detail out. Would you be willing to help on how to partition environments for each dataset?
I see that you have apparently submitted the code for your own method. If possible, could you also share the code for your implementation of INVRAT? It will help a lot if someone wants to follow your work.
For the following reasons, I believe the results in this paper are unreasonable:
The authors employed experimental settings significantly different from previous papers but directly copied and compared the results from those papers. This study focuses on semi-supervised tasks, comparing four semi-supervised methods. However, at least two of them are highly unfair. The authors directly copied results from the original papers of IB and STRAT for comparison, despite using completely different sparsity levels. The paper applied ={0.1,0.2,0.2,0.08} for four datasets, which seems like carefully chosen values, while IB used ={0.4, 0.25, 0.2, 0.2}. The significant difference in the proportion of selected rationales makes this comparison unfair.
The authors claim to have outperformed all semi-supervised methods, but the only semi-supervised method they reasonably compared with is WSEE, which is a very old method (EMNLP 2020). They also claim superiority over advanced unsupervised methods, but according to the authors' definition of the problem, all unsupervised methods can be trained in a semi-supervised manner. The authors did not compare these advanced methods under fair conditions.
Due to the above reasons, only two weak baselines, Vanilla Semi-RAT and WSEE, were fairly compared with SSR.
Most importantly, I suspect fraudulent behavior on the part of the authors. The authors claim to have implemented INVRAT (a method using invariant learning) on four real-world datasets, which is practically impossible. Implementing INVRAT requires splitting the dataset into two environments with different degrees of spurious correlations. In the original INVRAT paper, its authors used datasets with multi-aspect classification, dividing the environment based on the correlation degrees of labels for different aspects or using synthetic datasets already splitted into different environments with varying degrees of spurious correlations. However, the datasets in this paper do not meet these conditions and cannot be divided into different environments. Furthermore, the authors did not provide implementation details of their implemented INVRAT. The results of the authors' experiments forced me to look at the details of the INVRAT implementation over and over again, and led me to determine that implementing INVRAT on these datasets was never possible. Therefore, I suspect that the authors fabricated the results for INVRAT.
I would strongly urge the author to disclose all details and avoid any attempt to mislead readers unfamiliar with this area of research, which could lead them to waste research time on incorrect conclusions.
Hi,
I've been studying about rationalization lately as well. But I don't agree with your comments on INVRAT. From the original INVRAT, I find that the environment in the beer dataset is achieved through the linear prediction error. This suggests that the environments can be inferred from the data in some way. I can share my method which can achieve INVRAT with you:
1.Train an RNP and obtain the non-rationale representation of all samples (non-rationale can be used for environment partitioning [1,2]).
2.Cluster all the non-rationale representations using k-means clustering (refer to [2] for the use of clustering for environmental inference).
3.Assume each cluster of samples belongs to a single environment.
Of course, there are many methods for environmental inference, and the above method I used is relatively simple. You can also try other methods that use variational inference to infer environments, such as [3,4].
Finally, I am also currently submitting to ICLR. I notice that you first commented on 06 Dec 2023, which is a blocking period for replies for authors. According to ICLR rules, from 05 Dec 2023, authors can't upload any comments to openreivew. As far as my experience is concerned, authors can reply to your comments on openreivew only after the decision result of the paper is out.
References
[1] Provably invariant learning without domain information. Xiaoyu Tan, LIN Yong, Shengyu Zhu, Chao Qu, Xihe Qiu, Xu Yinghui, Peng Cui, Yuan Qi. ICML2023.
[2] Learning Invariant Graph Representations for Out-of-Distribution Generalization. Haoyang Li, Ziwei Zhang, Xin Wang, Wenwu Zhu. NeurIPS 2022.
[3] Learning Substructure Invariance for Out-of-Distribution Molecular Representations. Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia and Junchi Yan. NeurIPS 2022.
[4] Towards Out-of-Distribution Sequential Event Prediction: A Causal Treatment. Chenxiao Yang, Qitian Wu, Qingsong Wen, Zhiqiang Zhou, Liang Sun and Junchi Yan. NeurIPS 2022.
It seems that you are still attempting to fool readers. You claim that in the "beer dataset", environments can be achieved by "linear prediction error", leading to the conclusion that "the environments can be inferred from the data in some way", but you never mention how this "linear prediction error" is derived.
The results of this paper have prompted me to thoroughly review the implementation details of INVRAT. In the case of the beer dataset, each data point contains multiple aspects of beer descriptions, each with its own label. INVRAT assumes that spurious correlations in this dataset arise from correlations between different aspects. Therefore, it trains a predictor to predict the labels of the target category using labels from other categories, based on different "linear prediction errors" to partition the data into different environments. It is evident that this approach is only feasible for datasets with multiple labels. As I mentioned in my previous comments, the dataset used in this paper (SSR) does not meet these conditions.
Regarding the method you mentioned, it is fundamentally flawed. Firstly, in the field of invariant learning, a fundamental consensus is that it is almost impossible to partition a mixed dataset into different environments without additional information. Even if there are some specific techniques, they are based on challenging assumptions, at least not provided in INVRAT or mentioned by this paper. Additionally, the first step of your proposed method is fraught with difficulties - how do you get a non-rational representation? I assume you mean the unselected portion by the selector as non-rationale, but how can you determine that the selected part is not spurious correlations, and the unselected part is the true causal rationale?
In any case, the authors of this paper did not mention how they implemented INVRAT in the paper, suggesting that they considered implementation details trivial and simply reproduced it on different datasets using the methods provided by INVRAT. However, such reproduction seems impossible.
It seems that your implementation for INVRAT is too limited. The reason why potential environments can be inferred by means of linear prediction errors is that in the beer dataset, each data point contains multiple aspects of beer descriptions, each with its own label. This is also in line with your statement "a fundamental consensus is that it is almost impossible to partition a mixed dataset into different environments without additional information". I agree with you on all of the above.
However, I would like to state that [1] has shown that it is possible to train an environment partitioning policy based on attributes that are independent of the targets. The non-rationale (the unselected portion by the selector as non-rationale) is independent of the targets. [2] takes a similar approach to inferring the environment. I suggest the first step in your implementation: 1. Train an RNP and obtain the non-rationale representation of all samples. If you are familiar with the work on rationalization, you should know that RNP is trained using empirical risk minimization. In this setup, the rationale obtained by the RNP will inevitably contain information about both the rationale and the shortcut. On the contrary, the non-rationale will contain some task-irrelevant information. A well-trained RNP model can do basically this. This is my basic assumption for getting non-rationale using RNP. I can't guarantee that the extracted non-rationale will not contain the information that the rationale gets, but you likewise can't guarantee that the environment you get using the method of linear prediction errors is accurate. Any model faces a loss of accuracy.
I think you are too limited to the original implementation of INVRAT. INVRAT is an excellent paper. But it was published in 2020, when research on invariant learning was still largely focused on the assumption that the environment was known. For example, in the original INVRAT, INVRAT similarly manufactured an artificial IMDB dataset (where the environment was known) to validate the effects of INVRAT. However, as research proceeds, there are many ways to make inferences about the environment [1][2], which allows INVRAT to work on a wider range of datasets.
Per iclr policy, the authors cannot respond to our comments at this time, which is unfortunate. I suggest that the authors publish their code for INVRAT replication after the anonymity period has ended.
Finally, I still insist that you can reproduce INVRAT in the way I mentioned.
Your views on non-rationale suggest a fundamental lack of knowledge in probability and statistics. Refraining from using misleading references to refute me, if you are so confident in your approach, why not post your experiments and code directly?
If you are representing the authors, I am willing to provide additional evidence. If not, engaging in theoretical discussions with you would be a futile endeavor.
The paper concerns a problem of eliminating spurious correlations in rationalization of text classification. Rationalization is a task of identifying a small part of input being relevant to a decision made. The authors introduce a new method that identifies the so-called shortcuts which might be a "spurious" part of rationalization.
The considered problem is certainly important and the introduced method appears to be an interesting contribution. Nevertheless, the reviewers raised many critical remarks concerning writing, clarity of presentation, and experimental studies. The resubmitted version of the paper has corrected many of the issues present in the initial submission, containing better motivation and definition of the problem, and extended empirical studies.
为何不给更高分
The reviewers nicely converged and rated the paper as a 6. There is no one championing the paper.
为何不给更低分
The paper is borderline with additional issues raised in public comments. Nevertheless, the rebuttal and the new version of the paper resulted in all scores being 6.
Accept (poster)
Hello everyone, I have found some compelling evidence of fraudulent behaviour by the authors of this article, which interested parties can view at https://github.com/yuelinan/Codes-of-Inter-RAT/issues/1.
Here is a brief summary of the evidence. I found that the authors' another paper, "Interventional Rationalization," also reports INVRAT results on datasets that do not meet the necessary conditions for environment partitioning, and likewise lacks any description of how the environments were partitioned. I consulted the authors, who did not to provide a specific method for partitioning environments but merely stated that they used the approach from paper [1], "Learning Invariant Graph Representations for Out-of-Distribution Generalization" (https://openreview.net/forum?id=acKK8MQe2xc).
Recently, I discovered that "Interventional Rationalization" was initially submitted to ICLR in September 2022, whereas paper [1] was published in November 2022, clearly exposing flaws in the authors' defensive arguments.
Dear readers,
It's clear that this commenter has been making no sense. If you are interested, you can see from our conversation at https://github.com/yuelinan/Codes-of-Inter-RAT/issues/1 that I have responded to all of his questions and have open sourced the code.
Furthermore, in response to his query, I clearly stated "I would like to state that it is my conclusion (since the publication of neurips2023) to use clustering methods to classify environments and to implement invrat based on this. Until then, I use my empirical clustering methods to classify environments. So when you ask a question in December 2023, am I going to use my 2022 experience to answer you? I think it is more responsible to answer you using current conclusions I have already reached."
Besides, during the iclr review, this commenter intentionally commented at a stage when the authors were unable to respond to public comments in an attempt to influence the ac's decision on the paper. I consider this to be malicious harassment by a peer.
I'll keep publicizing the discussion on GitHub.