PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
4
3
ICML 2025

ExLM: Rethinking the Impact of $\texttt{[MASK]}$ Tokens in Masked Language Models

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24

摘要

Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with $[MASK]$ tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of $[MASK]$ tokens on MLMs. Analytical studies show that masking tokens can introduce the ***corrupted semantics*** problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $[MASK]$ tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.
关键词
Masked Language ModelsPre-trained ModelsLanguage ModelsText ModelingSMILES Modeling

评审与讨论

审稿意见
3

This paper investigates the role of mask tokens in Masked Language Models (MLMs). The authors first provide an empirical examination of the effect of the mask token through two perspectives: corrupted tokens and unreal tokens. Additionally, the authors propose a new algorithm, EXLM, to further enhance performance. The experimental evaluation demonstrates the effectiveness of the proposed method.

给作者的问题

  1. As indicated in Section 3.3 (Core Impact of Corrupted Semantics: Multimodality), Line 212, the corrupted context may correspond to multiple possible semantics. Can authors explain how EXLM addresses this issue?

论据与证据

  1. The authors claim that both unreal tokens and corrupted tokens arising from the mask token can affect performance. While the impact of corrupted semantics has been studied in previous literature (Line 65), the effect of unreal tokens does not appear to be significant for model performance (Fig. 3). This, in turn, underscores the importance of prior work. Given this, what is the authors' main contribution in analyzing unreal tokens? Can the author elaborate this?

  2. The connection between Section 3 and Section 4 is weak. The proposed method EXLM seems not motivated by the analysis given in Section 3. What is the motivation behind the proposed method, EXLM? At the end of Section 3, the authors appear to focus on studying the optimal mask ratio, but this aspect is not discussed in Section 4.

方法与评估标准

  1. The baseline approaches seem too weak as they were proposed a few years ago.

  2. The proposed method, EXLM, incorporates 2D RoPE, which further enhances model performance, as shown in Table 2. Therefore, the performance gain is not solely attributable to EXLM, making it difficult to assess the true effectiveness of the approach

理论论述

NA

实验设计与分析

No Question

补充材料

I didn't review the supplementary material.

与现有文献的关系

Please see Claims And Evidence

遗漏的重要参考文献

NA

其他优缺点

NA

其他意见或建议

NA

作者回复

We appreciate the insightful suggestions from Reviewer j1MR. In the following sections, we will address all your concerns. These discussions will also be incorporated into the final camera-ready version of the paper. Any further comments are welcome!


Q1: What is the authors' main contribution in analyzing unreal tokens?

Thank you for the question. The repeated MLM experiments are primarily designed to analyze the key factors that influence the performance of MLM models and the underlying reasons. Toward this goal, this section presents the following two main contributions:

  1. We demonstrate that the issue of unreal tokens, which has received significant attention in previous works, is not the primary factor affecting MLM performance.
  2. We show that corrupted semantics and the resulting multimodality phenomenon are more critical in influencing the performance of MLMs.

In summary, this analytical section offers a new perspective for understanding MLMs by highlighting the previously overlooked importance of corrupted semantics. It also suggests that future developments of MLMs should focus more on addressing the problem of multimodality.


Q2: The connection between Section 3 and Section 4 is weak.

Thank you for the suggestion. We have revised Sections 3 and 4 to better highlight their connection. These changes will be included in the final camera-ready version. Specifically:

  1. The last paragraph of Section 3.3 has been updated to emphasize multimodality as a key factor affecting MLM performance. An effective model should thus be able to mitigate the impact of semantic multimodality.
  2. Section 4.1 has been reorganized to align with Section 3’s analysis. To address multimodality, ExLM introduces two mechanisms:
    • Intra-token Multimodality: Each missing token may have a diverse set of plausible candidates. Thus, we propose States Expansion to build a larger semantic space, enabling richer token predictions.
    • Inter-token Multimodality: The meaning of a token is intricately dependent on the semantics of its surrounding tokens. Thus, we introduce a Dependency Capture mechanism using a transition matrix to model semantic dependencies across tokens.

Q3: How does ExLM model multiple possible semantics?

We introduce a state expansion mechanism in ExLM, where each [MASK] token is associated with multiple expanded states, each representing a different possible semantic choice for that token.


Q4: The baseline approaches seem too weak as they were proposed a few years ago.

Thank you for the valuable suggestion. We have added comparisons with a more recent baseline model, 3ML_{self} [Liao et al., 2022], in Table 3. The results show that our ExLM model outperforms this stronger baseline, as detailed below:

MNLI-(m/mm)QQPQNLISST-2CoLARTEMRPCSTS-BMEAN
TUPE [Ke et al., 2021]86.2/86.291.392.293.363.673.689.989.284.9
3ML_{self} [Liao et al., 2022]84.8/84.991.191.492.961.481.289.290.185.2
ExLM86.9/86.792.093.193.964.678.889.690.586.2

In addition, we further evaluated the ExLM model on the SuperGLUE benchmark [Wang et al., 2019]. The following table shows that ExLM also achieves significant improvements over several strong baselines, further confirming the effectiveness of our approach:

ModelBoolQ (Acc.)CB (Acc.)COPA (Acc.)MultiRC (F1)
BERT74.483.963.068.1
Token Drop [Hou et al., 2022]73.083.964.067.7
SCTD [Zhong et al., 2023]73.887.568.068.9
ExLM76.788.069.171.3

Q5: The proposed method, ExLM, incorporates 2D RoPE, which further enhances model performance, making it difficult to assess the true effectiveness of the approach.

Thank you for pointing this out. In our experiments, all compared MLM baselines, including Vanilla MLM and Vanilla MLM++, adopt the same backbone architecture as ExLM, which uses RoPE as the default positional encoding. Therefore, the performance comparisons are made on a fair basis and can indeed assess the true effectiveness of the proposed method.

References:

[Hou et al., 2022] Hou, Le, et al. "Token Dropping for Efficient BERT Pretraining." ACL 2022.
[Zhong et al., 2023] Zhong, Qihuang, et al. "Revisiting Token Dropping Strategy in Efficient BERT Pretraining." ACL 2023.
[Wang et al., 2019] Wang, Alex, et al. "SuperGLUE: A stickier benchmark for general-purpose language understanding systems." NeurIPS 2019.
[Liao et al., 2022] Liao, Baohao, et al. "Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token." EMNLP 2022.
[Ke et al., 2021] Ke, Guolin, Di He, and Tie-Yan Liu. "Rethinking Positional Encoding in Language Pre-training." ICLR 2021.

审稿意见
3

This paper studies the semantic corruption issue in masked language modeling (MLM). To start the research, the authors design an experiment (repeated MLM) to show the relationship and significance of the corrupted semantics caused by masking. As a result, ExLM is proposed as a solution for the problem. In this LM, a couple of masked tokens are distributed to each masked position. The hidden states are then aligned using a transition matrix, then computing the final losses.

给作者的问题

Please see above.

论据与证据

Yes.

方法与评估标准

  1. The first part of the experiment show great performance gain on learning the language representation.

  2. However, the second part of the experiments are evaluated on a selection of GLUE and SQuAD2 dev sets. However, to my experiences, these dev sets are potentially biased to the official test sets, and the results are unstable. Considering that the performance gains showed in Table 3 are not very pronounced (e.g. 64.3 to 64.6 on CoLA), I need to point out this potential risk that this part of the evaluation can be problematic.

  3. The repeated MLM results are reported on MNLI, a frequently used dataset for NLU. Even though the results are not pronounced, I trust the results. The problem is I think only experimenting MNLI is not sufficient for the claim. At least one more dataset results (e.g. QQP, RTE) are needed to fully justify the claims (e.g. corrupted semantics and unreal tokens which matter).

理论论述

I did not check the proof in Appendix.

实验设计与分析

The experiments are run across a number of random seeds. They are fair.

补充材料

No supplementary material provided.

与现有文献的关系

The method proposed in the paper can be useful for training MLMs.

遗漏的重要参考文献

N/A

其他优缺点

Weakness:

  1. The paper states that the ExLM method is efficiently in modeling semantics. I agree with that but I do care about the computation efficiency of ExLM compared to MLM and other MLM variants, e.g. time cost, memory cost. This point is very important for the usage of the method, and it seems not discussed in the paper. If they are already provided somewhere, please remind me.

  2. The details of repeated MLM experiments are not clear for me. The authors mention that they train a number of MLMs with different repetition times k and mask ratios p. The results in Figure 3-5 are on the MNLI dataset. I am not sure whether they pre-training the MLM (e.g. on Wiki) and then fine-tune on MNLI, or just fine-tune on MNLI. In my opinion, these two approaches can cause very different results in the analysis.

其他意见或建议

Please see above.

作者回复

We express our gratitude to Reviewer 3GXq for the suggestions. In the following, we address all your concerns regarding the evaluation of NLU tasks, additional experiments on more tasks, the efficiency analysis of ExLM, and the details of repeated MLM experiments. These discussions will also be incorporated into the final camera-ready version of the paper. We hope our responses help clarify the paper, and we welcome any further comments or suggestions!


Q1: These dev sets are potentially biased to the official test sets, and the results are unstable.

A: Thank you for your insightful comment. We have evaluated the stability of the ExLM model on the GLUE benchmark by reporting the standard deviation of performance across multiple runs. The results, summarized in the table below, demonstrate that ExLM exhibits good stability overall:

Standard Deviation on the GLUE Benchmark:

TaskExLM Performance
MNLI-m86.9 ± 0.14
QQP92.0 ± 0.09
QNLI93.1 ± 0.43
SST-293.9 ± 0.50
CoLA64.6 ± 0.81
RTE78.8 ± 1.80
MRPC89.6 ± 1.00
STS-B90.5 ± 0.72
MEAN86.2

Furthermore, to further verify the effectiveness of ExLM, we also evaluated its performance on the SuperGLUE benchmark [Wang et al., 2019]. As shown in the table below, ExLM significantly outperforms several strong baseline models, further supporting the validity of our proposed approach:

ModelBoolQ (Acc.)CB (Acc.)COPA (Acc.)MultiRC (F1)
BERT [Devlin et al., 2019]74.483.963.068.1
Token Drop [Hou et al., 2022]73.083.964.067.7
SCTD [Zhong et al., 2023]73.887.568.068.9
ExLM76.788.069.171.3

Q2: At least one more dataset result (e.g. QQP, RTE) is needed to fully justify the claims.

A: Thank you for the suggestion. In fact, the repeated MLM results on QQP and RTE have already been included in Appendix F. As stated in Section 3.2 of the main text: “We also provide the results of this experiment on more tasks in Appendix F.”


Q3: The computation efficiency of ExLM compared to MLM and other MLM variants, e.g., time cost, memory cost.

A: Thank you for your comment. As mentioned in Section 5.3 of the main text, “Efficiency and entropy analysis of EXLM are provided in Appendix Q”. Specifically, we present a detailed discussion of ExLM's training time cost in Appendix Q.2, including a comparison with MLM models. Moreover, memory cost analysis is provided in Appendix I.4, which is also referenced in Section 4.3 of the main text: “The efficiency analysis is provided in Appendix I.”


Q4: The details of repeated MLM experiments are not clear. I am not sure whether they pre-train the MLM (e.g., on Wiki) and then fine-tune on MNLI, or just fine-tune on MNLI.

A: Thank you for your comment. In each group of repeated MLM experiments, we pre-trained separate MLM models using different values of k and p, and then fine-tuned and evaluated them on downstream tasks using the same settings. As stated in Appendix B.1 (Pre-training Configuration): “In the Repeated MLM experiment, we train a series of MLMs with different p and k parameters.”


References:

[Devlin et al., 2019] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." NAACL 2019.

[Hou et al., 2022] Hou, Le, et al. "Token Dropping for Efficient BERT Pretraining." ACL 2022.

[Zhong et al., 2023] Zhong, Qihuang, et al. "Revisiting Token Dropping Strategy in Efficient BERT Pretraining." ACL 2023.

[Wang et al., 2019] Wang, Alex, et al. "SuperGLUE: A stickier benchmark for general-purpose language understanding systems." NeurIPS 2019.

审稿人评论

Thanks for the authors' feedback and additional numbers. I update my score for that accordingly.

However, I still keep my question on the potential gain of the method, which is mostly evaluated on GLUE dev. There are too many papers doing the same thing achieving similar performance gain. On the other hand, the performance gain is still not pronounced to me, for example, simply comparing to training longer.

I am not sure whether the paper reaches the bar of ICML.

作者评论

Thank you very much for your thoughtful feedback and valuable suggestions.

We would like to highlight that, in the original version of the paper, we have compared ExLM and the standard MLM under both SMILES modeling and natural language modeling settings. Furthermore, we have evaluated ExLM across two distinct types of benchmarks—text understanding (e.g., GLUE) and molecular property prediction (e.g., MoleculeNet). In these tasks, ExLM consistently outperforms vanilla MLM, demonstrating stable and significant improvements, which strongly supports the effectiveness of ExLM.

Looking forward, we plan to extend ExLM to protein sequence modeling. ExLM’s enhanced ability to capture long-range semantic dependencies (such as co-evolutionary signals) can be particularly beneficial in protein-related tasks. We also believe that this line of work—enhancing semantic dependency modeling—holds great potential in the broader AI for Science domain.

Once again, we truly appreciate your comments and suggestions. We hope you will continue to follow our work, and we would be happy to discuss further if you have any other questions.

审稿意见
4

This paper gives a new way of utilizing [MASK] tokens in masked language models. It first performs an analysis of the semantic aspects of the [MASK] token and then proposes ExLM wherein multiple [MASK] tokens are introduced during pre-training. The authors then propose to utilize a learnt transition to get multiple semantic alignments which makes the model semantically richer and performs better than the previously proposed model on a suite of tasks.

给作者的问题

  1. "Corrupted semantics" is an obviously introduced noise in that masked language modeling (MLM) is a kind of denoising process. The authors claim that it can "...negatively affect the model's ability to learn accurate semantic representation...". It is not clear why this claim should be non-obvious. Having a large semantic corruption (large p^k) will obviously lead to performance degradation. Therefore, it's value is chosen carefully. But how is it a problem?

  2. In appendix H, why is the index of the last state chosen for the final loss? What if the last state is never reached? Is there a null transition? This needs more details.

  3. In the ablation, how is ExLM modeled without the transition matrix? Is it like CTC? Needs more details.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Not rigorously checked.

实验设计与分析

Yes. Experimental designs are sound.

补充材料

I reviewed the proposed dynamic programming alignment strategy.

与现有文献的关系

The paper is of broad interest in understanding and building new language models.

遗漏的重要参考文献

@inproceedings{namazifar2021warped, title={Warped language models for noise robust language understanding}, author={Namazifar, Mahdi and Tur, Gokhan and Hakkani-T{"u}r, Dilek}, booktitle={2021 IEEE spoken language technology workshop (SLT)}, pages={981--988}, year={2021}, organization={IEEE} }

The above citation proposed different noising techniques for ASR error robustness. It is somewhat related to this work.

其他优缺点

Strength:

The ExLM formulation is very interesting and exciting to me. The semantic richness acquired by the model as a result of this formulation can be very useful.

Weakness:

The paper can benefit from a more detailed explanation of the alignment algorithm. I feel like the explanation should be in the main text rather than in the appendix.

Also, the flow of ideas from section 3 to section 4 can be done in a much better way.

其他意见或建议

  1. It is unclear how the discussion on the impact of [MASK] (section 3) is essential for ExLM. The findings in section 3 seem fairly obvious. The fact that semantic corruption is the key factor for the learning process is hardly surprising (figure 4).

  2. The relation between section 3 and section 4 is not clear.

  3. The authors need to explain the alignment process, i.e. the forward-backward algorithm is the main text.

作者回复

Thank you for your valuable suggestions. In the following, we provide detailed responses to all your concerns regarding the missing reference, model details, writing organization, and implementation issues. These discussions will also be incorporated into the final camera-ready version of our paper. We sincerely welcome any further feedback you may have.


Q1: Missing reference about the noising techniques for ASR error robustness

A: Thank you for your suggestion. We have now added the missing citation to the draft of our paper, and we will also ensure that it is properly included in the final camera-ready version.


Q2: The explanation of algorithm details should be in the main text

A: Thank you for the feedback. We have reorganized the structure of the paper accordingly. In earlier versions, some important methodological details were placed in the appendix due to space limitations. However, since the camera-ready version allows for an additional page, we have moved more details about ExLM—such as the dynamic programming procedure and the forward-backward algorithm—into the main body of the paper.


Q3: The relation between Section 3 and Section 4 is not clear

A: Thank you for the insightful comment. We have revised the writing of Section 4.1 to more clearly highlight the connection between Section 3 and Section 4. Specifically, the analysis in Section 3 demonstrates that multimodality is a key factor impacting the performance of MLMs. To enhance ExLM’s capacity for handling multimodality, we propose two mechanisms tailored to address two distinct levels of multimodality:

  1. Intra-token Multimodality: The possible candidates for each masked token can be highly diverse. To address this, we introduce a States Expansion mechanism to construct a larger semantic space, allowing the model to learn richer and more diverse semantic information.
  2. Inter-token Multimodality: The meaning of a token is intricately related to the meanings of surrounding tokens. To capture this, we incorporate a Dependency Capture mechanism, where a transition matrix is used to explicitly model semantic dependencies between different states.

This design clearly establishes the relevance of the findings in Section 3 to the architectural choices in Section 4. These improvements have been added to the camera-ready version.


Q4: The impact of corrupted semantics in MLM is obvious and well-managed, so it’s unclear why it’s framed as a significant issue

A: Thank you for raising this important point. We agree that "corrupted semantics" are intentionally introduced as a form of noise in the MLM denoising process, and such noise is a legitimate component of the training strategy. Indeed, our goal is not to eliminate this noise—on the contrary, we explicitly mention in the paper that a certain level of corrupted semantics can benefit training.

However, the central issue we aim to address is the unintended side effects this noise may have on model behavior. Specifically, the purpose of the denoising process is to train the model to recover clean data from noisy inputs. While noise facilitates learning, it also modifies the model's input distribution, which may interfere with the model’s ability to learn accurate representations.

Therefore, a well-designed denoising process should maintain a reasonable noise level while minimizing the adverse effects of noise on model behavior. Our study identifies one such undesirable effect—multimodality—arising from corrupted semantics. We aim to mitigate the impact of multimodality without eliminating the beneficial aspects of noise, which is the motivation behind our focus on corrupted semantics in this work.


Q5: Why is the index of the last state chosen for the final loss? What if the last state is never reached?

A: Thank you for the question. The reason we choose the index of the last state (i.e., [EOS]) for the final loss is that we enforce a structural constraint: every valid decoding path in the DAG must terminate at the last state. This design eliminates the need to enumerate all possible ending states when summing the probabilities of valid paths, thereby reducing computational complexity.

Moreover, the DAG in ExLM is constructed in a way that guarantees the reachability of the last state from all other nodes. Thus, the last state is always reachable, and the validity of the loss computation is ensured. We will include these details in the camera-ready version.


Q6: How is ExLM modeled without the transition matrix? Is it like CTC?

A: Yes, your understanding is exactly correct. When the transition matrix is not used, ExLM is trained in a manner similar to CTC (Connectionist Temporal Classification), where the model considers all possible alignments between the expanded states and the target tokens. More implementation details on this setup will be added to the camera-ready version of the paper.

审稿人评论

Thanks for the clarifications and modifications. I will keep my score unchanged to 4 and suggest acceptance of the paper for the principled problem formulation and analysis.

作者评论

Thank you very much for reading our rebuttal and for your encouraging words! We sincerely appreciate your insightful comments—they have been truly inspiring and instrumental in improving our paper. Your response means a great deal to us, and we warmly welcome any further communication.

Authors

审稿意见
3

This work presents a deeper analysis into the effectiveness of mask token in MLM pre-training. The authors argue that the conventional use of [MASK] tokens can lead to a "corrupted semantics problem" where the masked context may become ambiguous and lead to multiple interpretations. To highlight this issue, the authors conduct a series of analytical studies, namely Repeated MLM, showing that the corrupted semantics problem as a more significant factor than the unreal token problem that can affect the performance of MLMs when applied to downstream tasks. To address this challenge, the paper introduces anovel pre-trained model ExLM. The key idea of ExLM is the expansion of [MASK] tokens in the input context into multiple hidden states, which allows the model to model a larger semantic space. By doing so, ExLM aims to increase the context capacity and capture richer semantic information, thereby reducing the ambiguity introduced by the masked tokens and reducing the semantic multimodality in token prediction. Experiments on various NLU tasks demonstrate the effectiveness of ExLM.

给作者的问题

Can you report the std of the results in Table 1 and Table 3? Some performance gains in these tables are quite marginal, so the std coul d hlep to clarify the significance of the results.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes.

实验设计与分析

The soundness of evaulation experiments has some concerns. First, the authors do not provide explanations about why they take SMILES as the evaluation task. It seems that the patten in molecular information cound be quite different with that in textual information. There should be some explanations about why the authors take this task and why the proposed method can work (as the preliminary analysis are mainly made on MNLI which is an NLU task). Second, I think there should be more results on NLU tasks to demonstrate the effectiveness of ExLM, because part of the performance gains shown in Table 3 are not significant enough.

补充材料

No.

与现有文献的关系

This work contributes to the prior analysis about the effectiveness of mask tokens.

遗漏的重要参考文献

No.

其他优缺点

The main weakness lies in the experiments.

其他意见或建议

No.

作者回复

Thank you for your thoughtful and constructive feedback. We sincerely appreciate your valuable suggestions. Below, we address all your concerns regarding the SMILES tasks, the standard deviation of model performance, and additional NLU tasks. The relevant results will be incorporated into the final camera-ready version of our paper. We hope our responses help clarify the work, and we welcome any further comments.


Q1: Why do they take SMILES as the evaluation task?

A: Thank you for the question. The reason for selecting SMILES as the evaluation task is that masked language modeling (MLM) on SMILES also encounters issues similar to those observed in natural language modeling. We conducted Repeated MLM experiments on the BACE task (a SMILES classification task), and the results are shown in the table below. These results exhibit a trend consistent with Repeated MLM experiments on natural language data, indicating that semantic corruption significantly affects SMILES modeling:

k \ P2.25%15%38.7%62.2%78.9%
157.673.570.468.663.0
269.572.769.665.3
470.172.067.2
869.970.7

These findings can be attributed to the presence of strong semantic dependencies in SMILES representations. For instance, forming valid molecular functional groups often requires precise coordination among multiple atoms, which conventional MLM fails to capture due to its inability to model such semantic associations effectively.


Q2: The standard deviation of the results in Table 1 and Table 3

A: Thank you for the helpful suggestion. We have computed the standard deviations of ExLM’s performance on both the GLUE and MoleculeNet benchmarks. The results are summarized in the tables below. Overall, ExLM demonstrates good stability across these benchmarks:

Standard deviation on MoleculeNet benchmark:

TaskExLM Performance
BACE79.6±0.8
BBBP72.8±1.2
Tox2178.2±0.1
SIDER64.5±0.7
MUV78.8±0.6
ClinTox91.6±1.8
ToxCast66.9±0.3
Mean76.1

Standard deviation on GLUE benchmark:

TaskExLM Performance
MNLI-m86.9±0.14
QQP92.0±0.09
QNLI93.1±0.43
SST-293.9±0.5
CoLA64.6±0.81
RTE78.8±1.8
MRPC89.6±1.0
STS-B90.5±0.72
MEAN86.2

Q3: There should be more results on NLU tasks to demonstrate the effectiveness of ExLM

A: Thank you for the excellent suggestion. We further evaluated the performance of ExLM on the SuperGLUE benchmark [Wang et al., 2019]. As shown in the table below, ExLM outperforms several baseline models by a notable margin, which demonstrates the effectiveness of our method:

BoolQ (Acc.)CB (Acc.)COPA (Acc.)MultiRC (F1)
BERT [Devlin et al., 2019]74.483.963.068.1
Token Drop [Hou et al., 2022]73.083.964.067.7
SCTD [Zhong et al., 2023]73.887.568.068.9
ExLM76.788.069.171.3

We hope these additional results provide more clarity and demonstrate the strong performance of our proposed model. Please let us know if further information are needed!

References:

[Devlin et al., 2019] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." NAACL 2019.

[Hou et al., 2022] Hou, Le, et al. "Token Dropping for Efficient BERT Pretraining." ACL 2022.

[Zhong et al., 2023] Zhong, Qihuang, et al. "Revisiting Token Dropping Strategy in Efficient BERT Pretraining." ACL 2023.

[Wang et al., 2019] Wang, Alex, et al. "Superglue: A stickier benchmark for general-purpose language understanding systems." Advances in neural information processing systems 32 (2019).

最终决定

This paper tackles a key limitation in masked language modeling (MLM): the semantic distortion introduced by [MASK] tokens during pretraining. The authors identify two sources of degradation—unreal tokens and corrupted semantics—and demonstrates via a novel “Repeated MLM” experiment that corrupted semantics is the more critical bottleneck.

To address this, they propose EXLM, a masked language model variant that:

  • Expands each [MASK] into multiple hidden states (States Expansion),
  • Models semantic dependencies between them via a transition matrix (Dependency Capture),
  • Uses a dynamic programming-based state alignment during training.

The model shows consistent gains across textual and molecular domains (SMILES), outperforming multiple baselines across several benchmarks.

Strength

  • Novelty: Strong analytical framing of [MASK] token effects, especially the disentangling of corrupted semantics from unreal tokens.
  • Applicability: Authors show value of this approach across multiple domains, enhancing generalizability.
  • Experiments: Systematic evaluations with ablation studies, entropy analysis, and case studies.

Weakness

  • Scalability: While the expansion of [MASK] tokens enriches semantics, it increases compute costs (multiple hidden states per mask), especially for large-scale models.
  • Transferability: The method is evaluated on standard benchmarks, but implications for multilingual or low-resource settings are unexplored.

Considering the feedback from the reviewers, the discussions, and the AC assessment, this paper is recommended as an Accept.