PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
6
6
7
3.8
置信度
正确性3.3
贡献度2.3
表达2.8
NeurIPS 2024

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

Existing DNA tokenization methods borrowed from NLP are poorly suited for DNA's unique properties. We introduce MxDNA, a framework that allows the model to learn an effective DNA tokenization strategy, showing robust performance on various tasks.

摘要

关键词
genomicstokenizationfoundation models

评审与讨论

审稿意见
6

The paper proposes an alternative method for tokenization in genomic foundation models. Existing approaches adopt methods from natural language processing (NLP) and apply those to tokenize genomic sequences. While these tokenization methods have been validated by human knowledge, there is no basis for their use as is for tokenization in genomics.

The proposed approach, MxDNA, attempts to learn a tokenization strategy with three properties – (a) discontinuity (b) overlapping tokens (c) ambiguity. These are considered to be inherent to genomic sequences, and consequently, an appropriate tokenization strategy should reflect these properties. The tokenization module comprises basic unit recognition and assembly, and forms one block of the entire encoder pipeline. It is sandwiched between transformer encoder blocks to learn global relationships, and transformer encoder blocks to refine the learned token embedings.

The authors close out the paper by comparing the performance of MxDNA against existing foundation models, including DNABERT, DNABERT-2, Nucleotide Transformer and HyenaDNA, on the Genomic and Nucleotide Transformer benchmarks. The authors have also conducted ablation studies on the various tokenization schemes in the MxDNA framework, and the components of the tokenization scheme used in MxDNA.

优点

The key innovation in MxDNA is the learned tokenization scheme, the motivation for which is clear and logical – there is no reason to expect that NLP tokenization schemes should work well for genomic sequences. Moreover, the properties of the MxDNA tokenization method appear to exhibit seemingly desirable properties. Consequently, it is easy to understand the rationale behind the design choices made by the authors in this paper.

In addition to this, the authors have used a varied selection of recent and commonly-used genomic foundational models in their experiments for the purposes of benchmarking. Pretraining on data that is used to train some of these other models, and performing the experiments on existing benchmarking data, also lends greater credence to the results in the paper.

缺点

The paper does not read well as a stand-alone manuscript. Many details are omitted from the main paper. The included descriptions of the components of MxDNA are also difficult to comprehend. This applies to the pseudocodes in the appendix too; Algorithms 2 and 3 use many undefined/poorly defined function calls and hence, do not measurably aid a reader in understanding the operations they are describing. Instead, the authors should look to write the pseudocode such that key aspects of the operations are apparent as the reader can always look to their released code for a full implementation. More generally, an effort should be made such that the main paper is self-contained such that the reader can understand the overview of the MxDNA architecture and its building bocks. Details can be relegated to the with the appendix being used to elaborate on details of the various steps.

I am also unconvinced by the use of an ambiguous tokenization strategy, introduced via jitter noise in MxDNA. While I can understand the need for tokenization to depend on the context, this should map to a fixed subsequence to map to different tokens in different contexts. On the other hand, a fixed sequence and context should map to tokens in a fixed manner.

Lastly, given the compute demands of foundation models, the absence of any discussion on computational complexity (training time, FLOPs, etc.) is a little worrying. For instance, the authors of DNABERT-2 show that large models may have a similar number of FLOPs to their smaller counterparts.

问题

Questions:

  1. Are the “discontinuous” and “overlapping” properties not somewhat contradictory, i.e., discontinuous tokens would not overlap? Perhaps, a better way of stating this is that the sequence tokens need not align with the genomic sequence.
  2. The standard deviations in the results are based on three experimental runs. Are these values statistically meaningful? This is important when trying to make substantive conclusions from the included results. In its current form, several data points in the paper fall within the 1-SD error bars of the top 2 results.
  3. Building upon Question 2, the standard deviations for MxDNA appear to be far lower than any of the competing foundation models. Why Is this so? If anything, the noise in the tokenization of MxDNA should increase the variance in its performance, thereby lead to higher values of standard deviation.
  4. Do any of the tasks in either benchmark require nucleotide-level resolution? This may unfairly bias the overall results towards models that perform tokenization at the single-nucleotide level, instead of k-mers or BPE.

Suggestions:

  1. From the description of the pipeline in Section 3.3, it is unclear if the tokenization scheme is learned separately. My understanding is that this too is learned during the pretraining of MxDNA, but this is not apparent from the description.

  2. In the ablation study of Table 3, only k-mers of length 6 are used. It is possible, and in fact likely, that different values of k are appropriate for different tasks. Hence, a different value o f k might lead to a higher value on the benchmarks than single nucleotide tokenization.

  3. Grammatical errors:

        a. Line 65: “only pretrained”  “only being pretrained”
        b. Line 141: “Ambiguous property”  “ “Ambiguity property”
        c. Line 611: “tokenizer used here are”  “tokenizer used here is”
    

局限性

The authors highlight the limitations of MxDNA as follows:

  1. They make the claim that a better method for tokenization may help in the discovery of biologically meaningful units, but state that the evaluation of the tokens learned by MxDNA has not been biologically validated.
  2. They also state that the range of MxDNA is limited due to its use of quadratic self-attention.
评论

S1: Tokenization Scheme Learning:

  • The tokenization scheme in MxDNA is indeed learned end-to-end with the model's other components during both pretraining and fine-tuning phases. This integrated learning approach is fundamental to the model's design and effectiveness. We will clarify this process in Section 3.3 to ensure it is apparent how integral the tokenization scheme is to the overall model architecture.

S2: Ablation Study on k-mer Length:

  • Our decision to use k-mers of length 6 was based on common practices and computational constraints. However, we acknowledge that different tasks might benefit from varying k-mer lengths. Thus, we provide the results of k =1, 3, 4, 6 using overlapping tokenization on Nucleotide Transformer Benchmarks. Some representative results with the average results are presented as follows:
Tokenization MethodH3K4me1H3K9acH4acpromoter_no_tataNucleotide Transformer Benchmarks Avg.
K = 151.6660.6359.2597.0475.01
K = 349.5761.2160.1696.9974.27
K = 448.7159.5159.1097.1274.24
K = 650.1164.7056.4996.8474.35
Max (3, 4, 6)50.1164.7060.1697.1275.33
  • The experimental results for k-mer lengths of 3, 4, and 6 reveal that different k-mer sizes exhibit distinct advantages across various tasks.
  • The Max (3 ,4 ,6) is the best performance each k-mer length (Max on each individual dataset). This proves your point that different tasks might benefit from varying k-mer lengths, and the max performance can be higher than single nucleotide tokenization.

S3: Grammatical Corrections:

  • Thanks very much for reading our paper so carefully and pointing out the grammatical errors. We will correct them in the final version of the paper.

[r1] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022): 1-39.
[r2] Zhou, Zhihan, et al. "DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes." The Twelfth International Conference on Learning Representations.
[r3] Dalla-Torre, Hugo, et al. "The nucleotide transformer: Building and evaluating robust foundation models for human genomics." BioRxiv (2023): 2023-01.
[r4] Ji, Yanrong, et al. "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome." Bioinformatics 37.15 (2021): 2112-2120.
[r5] Nguyen, Eric, et al. "Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution." Advances in neural information processing systems 36 (2024).
[r6] Vu, Mai Ha, et al. "Linguistically inspired roadmap for building biologically reliable protein language models." Nature Machine Intelligence 5.5 (2023): 485-496.

评论

Thank you for provide such a detailed response.

I believe that the clarification on the jitter noise should be explicity stated in the manuscript.

However, I am inclined to stand by my assessment as I believe the writing of the paper (both the main paper and appendix) needs to be substantially improved to enable readers to understand and potentially, utilize the novel tokenization scheme presented.

评论

Thank you for your valuable feedback!

We have revised the method section of the main paper to include a higher-level description and have posted it in the global rebuttal. Additionally, we have updated the appendix, refining the pseudocode based on your suggestions by removing undefined or poorly defined function calls and clarifying key aspects of the operations.

We hope these revisions make both the main paper and the appendix easier to understand. If there are still any clarity issues, please let us know, and we would greatly appreciate your continued feedback.

Below is the revised description of the methods in the Appendix:

Revised Pseudo code

A.2.1 Non-Maximum Suppression

This pseudocode describes the selection process for optimal regions based on scores, ensuring no overlap, and using kernel sizes to guide the selection.

The input consists of: positions (possible nucleotide positions), kernel sizes (possible kernel sizes), scores (scores for each (position, kernel size) pair) for existence of a basic unit of a given size at a given position. The output is positions (in nucleotide coordinates) of selected basic units with their corresponding kernel sizes.


Algorithm 1 Detailed Non-Maximum Suppression for Basic Unit Placement

  1. procedure NMS (positions P=[1,2,,l]P = [1, 2, \ldots, l], kernel sizes LNnL \in \mathbb{N}^n, scores SRl×nS \in \mathbb{R}^{l \times n})
  2. \quad Sort all (PiP_i, LjL_j) pairs by SijS_{ij} in descending order, where i[1,2,,l]i \in [1, 2, \ldots, l] and j[1,2,,n]j \in [1, 2, \ldots, n].
  3. \quad Initialize an output array with zeros MNlM \in \mathbb{N}^l
  4. \quad for each (PiP_i, LjL_j) pair in the sorted pairs do
  5. \quad \quad Calculate the start and end of the region at PiP_i with width LjL_j.
  6. \quad \quad if the region is not overlapped with any region in MM then
  7. \quad \quad \quad MPiNLjM_{P_i} \in \mathbb{N} \leftarrow L_j
  8. \quad \quad end if
  9. \quad end for
  10. \quad return MM.
  11. end procedure

A.2.2 Sparse Mixture of Convolution Experts


This pseudocode outlines the selective activation of convolutions at positions determined by Non-Maximum Suppression, using corresponding kernel sizes.

The input consists of: input (embeddings of nucleotides), positions (in nucleotide coordinates) of selected basic units with their corresponding kernel sizes. The output is the embeddings of the selected basic units.

Algorithm 2 Detailed Sparse Convolution

  1. procedure Sparse Convolution (input XRl×dX \in \mathbb{R}^{l \times d}, selected positions with kernel sizes MNlM \in \mathbb{N}^l)
  2. \quad Initialize an output array with zeros URk×dU \in \mathbb{R}^{k \times d}.
  3. \quad Initialize counter cnt=0cnt = 0.
  4. \quad for each ii in [1,2,,l][1, 2, \ldots, l] do
  5. \quad \quad if Mi0M_i \neq 0 then
  6. \quad \quad \quad cntcnt+1cnt \leftarrow cnt + 1.
  7. \quad \quad \quad Extract the segment of XX centred at ii with size MiM_i.
  8. \quad \quad \quad UcntRdU_{cnt} \in \mathbb{R}^d \leftarrow Calculate the dot product of the segment with the convolution kernel of size MiM_{i}.
  9. \quad end for
  10. \quad return UU.
  11. end procedure

A.2.3 Deformable Convolution

This pseudocode details how deformable convolution dynamically adjusts based on input features by modifying its parameters for each input segment.

The input consists of: input (embeddings of selected basic units. The output is the embeddings of the final tokens.


Algorithm 3 Detailed Deformable Convolution

  1. procedure Deformable Convolution (input URk×dU \in \mathbb{R}^{k \times d})
  2. \quad Initialize an output array with zeros YRk×dY \in \mathbb{R}^{k \times d}.
  3. \quad for each ii in [1,2,,k][1, 2, \ldots, k] do
  4. \quad \quad Calculate offsets ΔPiRf\Delta P_i \in \mathbb{R}^f based on UiU_i.
  5. \quad \quad Calculate modulation factors ΔMiRf\Delta M_i \in \mathbb{R}^f based on UiU_i.
  6. \quad \quad Extract the deformed segment of UU centred at ii according to ΔPi\Delta P_i.
  7. \quad \quad Weight the segment by ΔMi\Delta M_i.
  8. \quad \quad YiRdY_i \in \mathbb{R}^d \leftarrow Calculate the dot product of the segment with the convolution kernel of size ff.
  9. \quad end for
  10. \quad return YY.
  11. end procedure

评论

The revised pseudocodes are much easier to read. For future reference, could you try to ensure that your LaTeX markup renders on OpenReview (using the "Preview" tab); reading through math and/or pseudocodes in markup is challenging at the best of times.

I also appreciate the effort made to clarify each of the steps in the tokenization strategy. Please include this in your camera-ready submission as well. In light of these changes, I have amended my score.

评论

Thank you for your valuable suggestion on improving clarity, and for your positive feedback on our efforts!

We understand the importance of ensuring that LaTeX code renders correctly on all platforms. On our end, the LaTeX code renders successfully using both Edge and Chrome browsers on a Windows 11 system. However, we’ve noticed that rendering issues may occur on mobile devices and that it will not render successfully in some email systems. We recommend viewing the document on a PC or refreshing the page, which may resolve the issue.

We appreciate your support and are always open to further suggestions for improving the paper.

作者回复

We thank the reviewer for the insightful feedback.

W1: Clarity in Method Description:

  • We appreciate your feedback regarding the need for a more self-contained and more understandable manuscript. The modified method section with high-level motivation and description added is presented in the global rebuttal. We will modify the appendix in the final version for clarity and put the detailed implementation in the code as well.

W2: Tokenization Strategy and Jitter Noise:

  • Following Switch Transformer [r1], we only add stochasticity in training but not in inference, thus making the inference deterministic.
  • Furthermore, this stochastic element is minor with multiplicative noise sampled uniformly between 1-0.01 and 1+0.01, adding a little jitter to the probability distribution when determining the tokenization. It only adds slight variability during training and will not affect the tokenization results significantly.
  • The jitter noise here may serve as a kind of data augmentation as well. As reported in the Switch Transformer paper, the noise in the tokenization can be beneficial to the model.
  • During the training stage, as a sequence based model, the model does not have access to the context information directly, the jitter noise may make the model more robust when transfer to other context and not overfit to the sequence only.

W3: Computational Complexity Discussion:

  • The integration of a mixture of convolution experts and deformable convolution introduces an increased computational overhead initially due to the O(n log(n)) complexity of the learned tokenization mechanism (where n represents the number of nucleotides). This complexity is mitigated by the substantial reduction in sequence length after tokenization, which decreases the number of tokens processed by subsequent transformer layers with quadratic costs.
  • The detailed computational costs of the models are outlined as follows (Average Across 5 samples of sequence length of 510):
ModelFlops (G)Macs (G)Parameters (M)Number of Tokens
DNABERT2 [r2]24.8012.39117.07104.2
Nucleotide Transformer v2 100M [r3]16.638.3197.8986
DNABERT [r4]99.4849.7089.20507
HyenaDNA tiny d256 [r5]1.670.8321.64511
HyenaDNA tiny0.4410.2190.436511
MxDNA35.9417.93100.09512 -> 101.6
Learnt Tokenization Module0.9140.44611.69512 -> 101.6
Single Nucleotide Baseline94.8547.3892.95512

Q1: Discontinuous and Overlapping Tokens:

  • The discontinuous properties means that the tokens may not be continuous in the original nucleotide sequence, but they can overlap with each other in token-level. This is exactly that sequence tokens need not align with the genomic sequence. We state it in this way because we want to keep the original statement in [r6]. We will refine our explanation to avoid confusion and add your statement in the final paper.

Q2: Statistical Significance of Experimental Runs:

  • It is important in machine learning to draw a conclusion based on the average results of multiple runs to reduce the effect of randomness. Also, we believe it is better to provide standard deviations based on multiple runs to give a sense of the variance in the results, though we acknowledge that three runs may not be sufficient to provide statistically meaningful values.
  • However, the computational cost of training these models multiple times is currently prohibitive. Indeed, there are some data points in the paper that fall within the 1-SD error bars of the top 2 results, but we believe that the overall trends of average results are still clear.

Q3: Low Variance in MxDNA's Performance:

  • We were initially surprised by the lower standard deviations observed in MxDNA's performance compared to other models, but we decide to report the results as they are. In fact, as mentioned in line 246, adding noise only make a few difference to the tokenization results during training. Also, as reported table 11 of Switch Transformer, the noise in the tokenization can be beneficial to the model.
  • Furthermore, following Switch Transformer, we only add jitter noise in training stage and during inference, the model is deterministic with no noise added. This kind of data augmentation may contribute to a lower standard deviations.

Q4: Nucleotide-Level Resolution:

  • The tasks are all classification tasks at sequence level, and are not enforced to require nucleotide-level resolution.
  • Indeed, different tasks may benefit from different tokenization strategy such as single nucleotide, K-mer or BPE. We believe it is a feature of different tokenization methods rather than a unfair bias. For example, although single nucleotide tokenization may enjoy single nucleotide resolution, it will lead to much more computations compared to non-overlapping K-mer or BPE.
  • Consequently, it is kind of a advantage of MxDNA since it can capture different level of information (single nucleotide resolution, token level, sequence level) in the same model without increasing much computations. This adaptive feature makes MxDNA performs better in various downstream tasks.
审稿意见
6

DNA language models currently use standard tokenization schemes from NLP that might be unsuitable for modelling DNA sequences. This paper proposes a tokenization scheme called MxDNA that is specifically designed for DNA language modelling. MxDNA presents a learnable tokenization scheme that uses a mixture of convolutional experts and deformable convolution to learn task-aware tokens using gradient descent. This scheme can handle the variable and often ambiguous length of meaningful DNA elements which can also overlap each other. The authors demonstrate that a modified Nucleotide Transformer with the MxDNA tokenization outperforms existing models on most benchmarks from Genome Benchmarks and the ones from Nucleotide Transformer.

优点

  • Originality: The main novel contribution of this paper is the MxDNA tokenization method that enables the creation of learnable DNA tokens. Related works have been cited, including those on recent DNA language models.

  • Quality: The benchmarking results in the paper are generally thorough - the authors benchmark MxDNA against existing state-of-the-art DNA language models on established tasks and show that MxDNA generally outperforms existing models.

  • Clarity: The need for a learnable tokenization scheme is well-motivated but in general, I believe that the clarity of this paper needs to be significantly improved for acceptance. I have listed my concerns in the next section.

  • Significance: MxDNA makes a case for its usage in DNA language models based on its performance. Since DNA language modelling is becoming an increasingly popular research area in computational genomics, this work could be useful to the community if the clarity of this paper is improved.

缺点

My main concerns with this paper are related to the clarity of the information presented. The lack of clarity in method descriptions did not allow me to fully comprehend it and I do not think one could reproduce the authors' results using these descriptions.

Parts of the paper that need to be improved are listed below:

  • Although works related to DNA language modelling are cited, there isn't a clear motivation for using the specific modules of MxDNA. From the experimental results, we see that these modules work well when combined but a reader would want to know how the authors arrived at this formulation.
  • The main description of the tokenization module in Section 3.2 is very dense. Although I am familiar with language models, genomics, and deep learning more generally, I was not able to understand how this core module works. The appendix contains implementation details but these too are very dense. I believe that the paper will greatly benefit from a clear description of the module that first motivates the techniques being employed before providing a high-level description of each of these techniques. Then, more details about how these modules are used in MxDNA can be presented.
  • Section 4.1 on how the pretraining is performed omits many essential details. For example, is the full human genome used for training? Are repeat regions removed? What is the length of sequences being modelled and how is the masking performed (this could be addressed by a clearer Section 3.2)? I could not find any details beyond hyperparameters in the appendix.
  • The "Sample Level" part of Section 4.4 (including Figure 3) is confusing. Why is it desirable for two forward passes with the same sequence and model to yield different results? What is the source of this stochasticity? Shouldn't we expect the forward pass to be deterministic?

问题

  1. During pretraining, what was the length of the sequences being modelled? I could not find this in the main paper or the appendix.
  2. When you mask out 15% of the tokens for masked language modelling, is this masking performed at the nucleotide level or after the tokens have been generated using MxDNA? If it's after the tokens have been generated using MxDNA, how do you prevent the initial transformer layers (before MxDNA, that process nucleotide level sequences) from not attending to the masked tokens?
  3. In the appendix, it is mentioned that the tiny HyenaDNA model was used for benchmarking. One of the central contributions of HyenaDNA was to increase the sequence length during modelling, so how does MxDNA compare to the larger HyenaDNA models that were trained with longer sequences?
  4. In lines 244-246, it is mentioned that two forward passes with the same sequence and model can yield different results. Why is this happening? Shouldn't we expect the forward pass to be deterministic?

局限性

The authors have identified the limitations of their approach and I do not foresee any negative societal impacts.

评论
ModelNucleotide Transformer Benchmarks Avg.Histone Markers Avg.Regulatory Annotation Avg.Splice Site Annotation Avg.
hyenadna-tiny-1k-seqlen-d25675.9665.2484.8796.82
hyenadna-large-1m-seqlen64.5645.3084.3795.74

Q4: Forward Pass Determinism:

  • The stochasticity described in Section 4.4 and Figure 3 arises from the jitter noise added. Following Switch Transformer, we only add stochasticity in training but not in inference, thus making the inference deterministic.
  • Furthermore, this stochastic element is minor with multiplicative noise sampled uniformly between 1-0.01 and 1+0.01, adding a little jitter to the probability distribution when determining the tokenization. It only add slight variability during training and will not affect the tokenization results significantly.

[r1] Vu, Mai Ha, et al. "Linguistically inspired roadmap for building biologically reliable protein language models." Nature Machine Intelligence 5.5 (2023): 485-496.
[r2] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022): 1-39.
[r3] Ji, Yanrong, et al. "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome." Bioinformatics 37.15 (2021): 2112-2120.
[r4] Nguyen, Eric, et al. "Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution." Advances in neural information processing systems 36 (2024).
[r5] Dai, Jifeng, et al. "Deformable convolutional networks." Proceedings of the IEEE international conference on computer vision. 2017.
[r6] Zhu, Xizhou, et al. "Deformable convnets v2: More deformable, better results." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

评论

Dear Reviewer,

Thank you once again for your valuable suggestions. Following your advice, we have revised the method section of the main paper to include a higher-level description, which we have presented in the global rebuttal. Additionally, we have enhanced the clarity of the pseudocode and included this in our response to Reviewer eT4M.

We hope that these revisions address your concerns as well. If there are any remaining issues with clarity, please do not hesitate to let us know. We would greatly appreciate your continued feedback.

评论

Thank you for the detailed response! In light of the revised description and additional experiments, I am raising my score.

评论

Thank you so much for your support and for raising your score to Weak Accept. We are pleased that the revised description and additional experiments addressed your concerns, and will include all these descriptions and experiments in the final version. If you have any further suggestions or issues, please don't hesitate to let us know. We are always open to feedback and committed to improving the paper.

作者回复

We thank the reviewer for the valuable feedback.

W1: Related Works and Motivation:

  • We appreciate your feedback emphasizing the need for clearer motivations behind the modules used in MxDNA. Our approach is fundamentally inspired by the desired properties for genomic tokenization—Meaningful, Discontinuous, Overlapping, and Ambiguous—as outlined in the literature [r1]. These characteristics guide our development of a learnable tokenization method tailored to meet these specific genomic needs.
  • Our initial step in addressing these requirements involves aggregating neighboring nucleotides into meaningful basic units. Initially, we consider using strided convolutions because they excel at capturing local features within small windows. However, the fixed kernel size and stride of standard strided convolutions limit their adaptability, which is crucial for learning a dynamic tokenization strategy. To overcome this, we explore using a variety of convolution kernel sizes applied adaptively across different sequence positions, drawing from the Mixture of Experts (MoE) framework. In our adaptation, we replace traditional mlp experts with convolution experts that have varying kernel sizes, allowing us to capture basic units of different lengths effectively.
  • To select the most significant tokens from these convolutions, we employ a one dimensional non-maximum suppression technique on the gating logits, which helps refine the selection of basic units.
  • Following the aggregation of basic units, we seek a method to handle more complex genomic patterns that go beyond simple segmentation. This leads us to integrate Deformable Convolution [5, 6], known for its capability to model complex local geometric transformations. The one-dimensional adaptation of deformable convolution is particularly well-suited for genomic sequences, enabling the model to address discontinuous properties by predicting offsets that link distal basic units and to handle overlapping properties by reusing basic units across different tokens.
  • This comprehensive design allows MxDNA to effectively capture and represent the complex structural dynamics of genomic data.

W2: Tokenization Module Clarity

  • We acknowledge that the description of the tokenization module in Section 3.2 is dense and could be clearer. This module is crucial as it learns to tokenize the sequence end-to-end together with the rest of the model. The modified method section with high-level motivation and description added is presented in the global rebuttal. We will modified the appendix in the final version for clarity and put the detailed implementation in the code as well.

W3: Pretraining Details:

  • We largely follow the pre-training procedure of DNABERT. We use the full human genome for pretraining.
  • We removed all sequences gaps and unannotated regions and extracted 70 to 510-nt-long sequences as training data. We do not remove the repeated regions.
  • All masking happens at the initial input stage(single nucleotide, 6mer tokens, bpe tokens). For model using single nucleotide tokenization, non-overlapping 6mer and BPE, the masking is performed randomly and mask out 15% of total tokens except of special tokens. For model using overlapping 6mer, we follow the strategy used in DNABERT, with contiguous k-length spans of certain k-mers are masked, totalling ~15% of the tokens.

W4: Stochasticity in Sample Level:

  • The stochasticity described in Section 4.4 and Figure 3 arises from the jitter noise added. Following Switch Transformer [r2], we only add stochasticity in training but not in inference, thus making the inference deterministic.
  • Furthermore, this stochastic element is minor with multiplicative noise sampled uniformly between 1-0.01 and 1+0.01, adding a little jitter to the probability distribution when determining the tokenization. It only add slight variability during training and will not affect the tokenization results significantly.

Q1: Sequence Length During Pretraining:

  • The lengths of sequences used in pretraining varied from 70 to 510 nucleotides, largely following the protocol used in DNABERT [r3] (5 to 510-nt-long in DNABERT). This range was chosen to adequately represent the diversity of genomic data while ensuring efficient processing.

Q2: Masking Strategy

  • Masking during the pretraining phase is implemented at the nucleotide level before any tokenization by MxDNA. This approach prevents potential information leakage by ensuring that the initial transformer layers do not have access to masked tokens.
  • Cross-attention mechanisms are used to align the reduced token count from MxDNA with the original sequence length, allowing the model to perform masked language modelling effectively using the learnt tokens.

Q3: Comparison with HyenaDNA:

  • We selected the tiny HyenaDNA [r4] model for benchmarking based on the recommendations of the HyenaDNA authors, who advocate for the use of tiny models and perform extensive hyperparameter search on each downstream datasets.
  • Their research suggests that training with sequence lengths 2 to 4 times the length of sequences used in downstream tasks typically yields the best performance. Thus, the tiny models are the best choice for most of the downstream tasks in Nucleotide Transformer Benchmarks and Genomic Benchmarks since most of tasks have sequence length of around a few hundreds and the tiny model are pre-trained with 1000 length sequence.
  • We provide the results of the largest HyenaDNA model on Nucleotide Transformer Benchmarks as follows (Without extensive hyperparameter search: Epochs: 100, Batch Size: 256, Learning Rate: 6e-4, Scheduler: Cosine Decay):
审稿意见
6

The paper introduces MxDNA, a novel framework for adaptive DNA sequence tokenization. Unlike traditional tokenization methods borrowed from NLP, MxDNA uses a sparse Mixture of Convolution Experts coupled with deformable convolution to autonomously learn an effective tokenization strategy through gradient descent. This method explicitly considers the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments. The paper demonstrates superior performance on Nucleotide Transformer Benchmarks and Genomic Benchmarks, highlighting MxDNA's effectiveness with less pretraining data and time.

优点

  • Originality: The approach of learning tokenization through gradient descent rather than relying on predefined rules is innovative and well-suited to the complexities of genomic data.
  • Clarity: The paper is mostly clear and well-organized, with a logical flow from motivation to method to results.

缺点

  • Important baseline: VQDNA (Li et al, ICML 2024) seems a high related work with MxDNA, which should be discussed in the paper.

  • Theoretical Justifications: The theoretical underpinnings of why the learned tokenization method performs better are not fully explored. More rigorous proofs and explanations would strengthen the paper. Will be Better with some biological cases.

问题

  • Can the authors provide more intuitive explanations and visualizations for the sparse Mixture of Convolution Experts and the deformable convolution? Is there some real biological cases matches the learned pattern?

  • Could MxDNA used as a tokenization method to other sequence model to enhance their performance?

局限性

The authors have acknowledged the limitations, such as the lack of direct biological validation of the model's tokenization decisions and the challenges with long-range tasks due to the quadratic cost of self-attention. The paper would benefit from a more detailed discussion of these limitations and potential strategies to address them in future work.

作者回复

We thank the reviewer for the valuable feedback.

W1: Discussion of VQDNA:

  • We didn't include VQDNA[r1] in our initial submission because it was published very close to the NeurIPS deadline.
  • We share a similar motivation with VQDNA. following VQVAE, VQDNA employs a convolutional encoder with a vector-quantized codebook to model tokenization, but our MxDNA uses a sparse mixture of convolution experts with deformable convolution to model tokenization. The VQDNA is pretrained on multi-species data while our MxDNA is only pretrained on Human Reference Genome. The settings used for benchmarking is also different, but both of the two methods outperforms the similar baseline DNABERT2 (BPE) on Histone Markers Prediction (Epigenetic Marks Prediction).
  • we will add a detailed comparison with VQDNA in the final version of our paper.

W2: Theoretical Justifications:

  • The motivation of MxDNA is based on the fact that human do not understand the DNA language well, but the model may do a better job than human. Providing concrete theoretical justifications for such an intuition-based approach is challenging.
  • We use t-SNE to cluster tokens based on genomic functions as in the PDF. We use four types of data in the Nucleotide Transformer Benchmarks as input ("H3" for Histone Marker, "enhancers" for Enhancer, "promoter_all" for Promoter and "splice_sites_all" for Splice Site), and analyse the output embeddings of different pretrained models at token level. As is shown in the figure, without any finetuning, the token embedding distribution of MxDNA is different across sequences with different functions: the tokens of Histone Marker, Promoter and Splice Site form unique clusters. While for all other foundation models, their tokens do not form clear clusters as MxDNA does.
  • There is also a token lengths distribution visualization in our paper which shows different patterns in different downstream tasks. It is also very different from token length distribution of K-mer and BPE.

Q1: Intuitive Explanations and Biological Cases:

  • Detection of Basic Units: The design of the basic units recognition is similar to the process of object detection in computer vision, where the model learns to detect objects in an image. A detection model proposes several bounding boxes, eliminates duplicate detections ands select the most relevant bounding boxes by non-maximum suppression. The case of MxDNA is similar, with the data being 1D instead of 2D. Then, each bounding box is embedded through a convolution kernel of corresponding kernel size, giving the embedding of the basic units.
  • Deformable Convolution for Adaptation: Following the initial detection, deformable convolution is utilized to address the dynamic and irregular patterns found in genomic data. Unlike standard convolutions with fixed geometries, deformable convolution adjusts its receptive fields dynamically. This flexibility is critical for accurately modeling discontinuous and overlapping genomic features.
  • Some biological analyses including token embedding distribution and token length distribution are dicussed in W2.

Q2: Extension to Other Sequence Models:

  • Given the architectural similarities with existing genomics models, primarily BERT-like frameworks with minor modifications, MxDNA's tokenization approach is likely to enhance the performance of other genomic sequence models.
  • Additionally, we have extended our methodology to RNA sequences, utilizing the training procedure from the recently introduced Beacon [r2] framework. The superior performance on downstream datasets underscores MxDNA's potential to significantly improve performance across different types of biological sequences:
ModelIsoform Accuracy (R^2 %)Mean Ribosome Loading (R^2 %)
Beacon-B51272.0072.35
MxDNA81.3081.21

L1: more detailed discussion of limitations and potential strategies:

  • Direct biological validation: Our method start from the fact that human do not understand the DNA language well, but the model may do a better job than human. For the design of specific methods, we are inspired by discontinuous, overlapping and ambiguous properties proposed by [r3]. However, we only validate our results on two benchmarks empirically and direct biological validation is lacking. In the future, we may use the regulatory activity data in [r4] to perform clustering or correlation analysis on learnt tokens and biological traits.
  • Long-range validation: As presented in the table below, our methods itself will reduce the sequence length and the total computations. It is the quadratic self-attention that prevents us from evaluating on more long range tasks. In the future, we will consider similar strategies used by [r5]: hybrid architecture that are generally cheap in computation while maintaining strong long-range interaction ability.
ModelFlops (G)Macs (G)Parameters (M)Number of Tokens
MxDNA35.9417.93100.09512 -> 101.6
Learnt Tokenization Module0.9140.44611.69512 -> 101.6
Single Nucleotide Baseline94.8547.3892.95512
评论

[r1] Li, Siyuan, et al. "VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling." Forty-first International Conference on Machine Learning.
[r2] Ren, Yuchen, et al. "BEACON: Benchmark for Comprehensive RNA Tasks and Language Models." arXiv preprint arXiv:2406.10391 (2024).
[r3] Vu, Mai Ha, et al. "Linguistically inspired roadmap for building biologically reliable protein language models." Nature Machine Intelligence 5.5 (2023): 485-496.
[r4] Chen, Kathleen M., et al. "A sequence-based global map of regulatory activity for deciphering human genetics." Nature genetics 54.7 (2022): 940-949.
[r5] Nguyen, Eric, et al. "Sequence modeling and design from molecular to genome scale with Evo." BioRxiv (2024): 2024-02.

评论

Thanks for your rebuttal. Also, do you guys have an anonymous link to show the code or trained checkpoint? I'm relatively curious about the exact implementation.

评论

Thank you once again for your valuable suggestions.

Regarding the exact implementation of the Learnt Tokenization Module (including the sparse Mixture of Convolution Experts and Deformable Convolution), we have provided an anonymous link to the code: https://anonymous.4open.science/r/Rebuttal-mxdna/. This repository contains the core implementation of MxDNA. The full implementation of MxDNA will be released upon acceptance of the paper.

We hope this helps address your concerns. If you have any further suggestions or issues, please don't hesitate to let us know.

审稿意见
7

The paper introduces MxDNA, a novel framework designed to autonomously learn effective DNA tokenization strategies through gradient descent. Unlike traditional methods borrowed from natural language processing, MxDNA employs a sparse Mixture of Convolution Experts and deformable convolution to address the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments. MxDNA demonstrates superior performance on Nucleotide Transformer Benchmarks and Genomic Benchmarks with less pretraining data and time compared to existing models.

优点

Originality:

  • Introduces MxDNA, a novel framework for adaptive DNA sequence tokenization, leveraging a mixture of convolution experts and deformable convolution, which is a significant departure from traditional NLP-based tokenization methods.

Quality:

  • Demonstrates superior performance on Nucleotide Transformer Benchmarks and Genomic Benchmarks with less pretraining data and time, highlighting the robustness and effectiveness of MxDNA.
  • Provides thorough empirical evaluation with comprehensive benchmarks, showcasing state-of-the-art performance.

Clarity:

  • The paper is well-organized, with clear explanations of the novel methods and their implementation.
  • Includes visual aids and diagrams to help elucidate the complex processes involved in the MxDNA framework.

Significance:

  • Offers a new perspective on DNA tokenization, potentially leading to broader applications in genomics and new biological insights.
  • The learned tokenization strategy distinct from existing methods could pave the way for future advancements in genomic sequence modeling.

缺点

1. Evaluation on Long-Range Tasks

  • Propose Alternatives: Discuss integrating sub-quadratic attention mechanisms like Linformer or Longformer to reduce computational costs and allow for long-range genomic analysis.
  • Hybrid Models: Suggest hybrid architectures combining local and global attention to balance computational efficiency and capture of long-range dependencies.

2. Analysis of Tokenization Behavior

  • Detailed Token Analysis: Perform in-depth analysis by clustering tokens based on genomic functions and locations, using visualization tools like t-SNE.
  • Functional Correlation: Correlate tokens with known genomic features and discuss the implications of their alignment or misalignment. Visualization and Interpretability: Include visualizations of token distributions and introduce interpretable metrics to evaluate token significance.

3. Clarity in Methodological Innovations

  • Enhanced Diagrams and Pseudocode: Use detailed diagrams and pseudocode to clarify the operation of mixture of convolution experts and deformable convolution.
  • Algorithm Descriptions: Expand the appendix to include detailed, step-by-step algorithm descriptions.
  • Examples and Glossary: Provide concrete examples of the tokenization process and include a glossary of terms to aid understanding.

问题

  1. What are the computational trade-offs of using a mixture of convolution experts and deformable convolution?

局限性

The authors have addressed the technical limitations well but could enhance their discussion on broader societal impacts.

  • Acknowledgement: The authors have acknowledged the limitations of their work, such as the lack of biological validation and challenges in handling long-range dependencies due to the quadratic cost of self-attention.

  • Proposals for Future Work: They propose future research directions to address these limitations, such as integrating sub-quadratic attention mechanisms and hybrid models.

  • Discussion: The paper does not explicitly address potential negative societal impacts, such as data privacy concerns or ethical considerations in genomic research.

评论
ModelFlops (G)Macs (G)Parameters (M)Number of Tokens
MxDNA35.9417.93100.09512 -> 101.6
Learnt Tokenization Module0.9140.44611.69512 -> 101.6
Single Nucleotide Baseline94.8547.3892.95512

[r1] Zhou, Jian, and Olga G. Troyanskaya. "Predicting effects of noncoding variants with deep learning–based sequence model." Nature methods 12.10 (2015): 931-934.
[r2] Nguyen, Eric, et al. "Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution." Advances in neural information processing systems 36 (2024).
[r3] Dalla-Torre, Hugo, et al. "The nucleotide transformer: Building and evaluating robust foundation models for human genomics." BioRxiv (2023): 2023-01.
[r4] Ji, Yanrong, et al. "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome." Bioinformatics 37.15 (2021): 2112-2120.
[r5] Zhou, Zhihan, et al. "DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes." The Twelfth International Conference on Learning Representations.

评论

Thanks for the clarification. I'll keep my recommendation for acceptance.

评论

Thank you for your support and for keeping your recommendation for acceptance. We truly appreciate your feedback throughout this process.

作者回复

We thank the reviewer for the insightful feedback. We have acknowledged the points raised W1/W2/W3 and discussed these in the limitations section of our paper.

W1: Evaluation on Long-Range Tasks

  • We evaluate our model with pre-trained sequence length of 510 on the long range task proposed by DeepSEA [r1] with Sequence length of 1000 (Which is used by HyenaDNA [r2] and Nucleotide Transformer [r3]). The results in AUROC are as follows:
ModelTFDHSHMAvg.Pretraining Data
DeepSEA95.892.385.691.2NA
HyenaDNA96.493.086.391.9Human Reference Genome
DNABERT [r4]95.291.982.389.8Human Reference Genome
DNABERT2 [r5]96.292.686.391.7Multi-species
NTv2 100M96.492.786.691.9Multi-species
MxDNA96.592.986.391.9Human Reference Genome

It achieves comparable performance to HyenaDNA and Nucleotide Transformer v2 100M, and outperforms DNABERT, DNABERT2 and DeepSEA. Notably, DNABERT2 and Nucleotide Transformer v2 100M are pre-trained on multi-species data, while MxDNA is pre-trained on human reference genome only. Also, HyenaDNA is aimed at long-range tasks, and it is expected to perform better on this task.

  • Our research mainly focuses on learnt tokenization. Since the pretraining of foundation models are costly and the time limit of rebuttal, we will explore the combination of sub-quadratic attention mechanism and hybrid architectures in the future.

W2: Analysis of Tokenization Behavior:

  • We use t-SNE to cluster tokens based on genomic functions as in the PDF. We use four types of data in the Nucleotide Transformer Benchmarks as input ("H3" for Histone Marker, "enhancers" for Enhancer, "promoter_all" for Promoter and "splice_sites_all" for Splice Site), and analyse the output embeddings of different pretrained models at token level. As is shown in the figure, without any finetuning, the token embedding distribution of MxDNA is different across sequences with different functions : the tokens of Histone Marker, Promoter and Splice Site form unique clusters. While for all other foundation models, their tokens do not form clear clusters as MxDNA does.
  • Though it will be good if the tokens correlate with known genomic features such as motifs, we follow the motifs discovery pipeline of DNABERT but the motifs do not match with existing motifs. This mismatch may imply that the model has learnt its unique way to interpret genomic sequences from the way biological experiments do.
  • For token distribution analysis, there is a token lengths distribution visualization in our paper which shows different patterns in different downstream tasks. Additionally, we use t-SNE to visualize the token distribution in the embedding space and show that the clusters formed by MxDNA is clearer than other models. As for interpretable metrics, we may consider Silhouette Coefficient or other metrics to evaluate the quality of the clusters in the future.

W3: Clarity in Methodological Innovations:

  • We acknowledge the need for enhanced clarity in our methodological innovations. We will try to improve the clarity of our main paper and appendix. The modified method section with high-level motivation and description added is presented in the global rebuttal. A glossary of terms is given below:
TermDescription
llNumber of nucleotides
ddDimension of hidden states
nnNumber of experts
kkNumber of basic units
ffDeformable convolution kernel size
iiIndices of nucleotides or tokens
jjIndices of experts
XRl×d\mathbf{X} \in \mathbb{R}^{l \times d}Input nucleotide sequence
SRl×n\mathbf{S} \in \mathbb{R}^{l \times n}Expert confidence scores
LNn\mathbf{L} \in \mathbb{N}^{n}Expert kernel sizes
MNl\mathbf{M} \in \mathbb{N}^{l}Basic units existence mask
E{Conv1D}n\mathbf{E} \in \{\text{Conv1D}\}^{n}Expert convolution kernels
URk×d\mathbf{U} \in \mathbb{R}^{k \times d}Basic units
ΔPRk×f\Delta \mathbf{P} \in \mathbb{R}^{k \times f}Deformable convolution offsets
ΔMRk×f\Delta \mathbf{M} \in \mathbb{R}^{k \times f}Deformable convolution modulations
TRk×d\mathbf{T} \in \mathbb{R}^{k \times d}Final Tokens

Q1: What are the computational trade-offs of using a mixture of convolution experts and deformable convolution?:

  • The integration of a mixture of convolution experts and deformable convolution introduces an increased computational overhead initially due to the O(n log(n)) complexity of the learned tokenization mechanism (where n represents the number of nucleotides). This complexity is mitigated by the substantial reduction in sequence length after tokenization, which decreases the number of tokens processed by subsequent transformer layers.
  • Additionally, our implementation in C++ using Pybind11 leverages batch-wise parallelism, which provides a significant speed advantage over typical Python implementations. This reduction in sequence length leads to diminished overall computational demands, as transformer computations generally scale quadratically with the number of tokens.
  • The detailed computational costs of the models are outlined as follows (Average Across 5 samples of sequence length of 510):
作者回复

General Description:

Thanks for the valuable feedback provided by all reviewers. We appreciate all the reviewers JPfb (R1), nT57 (R2), RCJz (R3) and eT4M (R4) for approving our contributions: (1) innovative method (R1, R2, R3, R4), (2) thorough experiments (R1, R3, R4). Besides, the concerns are mainly concentrated on (1) presentation clarity (R3, R4), (2) more experiments and biological analysis (R1, R2). Under the NeurIPS policy, we will follow reviewers’ suggestions to refine method section of the paper at our discretion.

Additional Experiments:

In the responses, we show additional experimental results and analysis including:

  1. Computational cost evaluation including flops, macs, parameters and tokens (R1: Q1, R4: W3) (Table r1)
  2. Biological analysis via t-SNE clustering of tokens based on genomic functions (R1: W2, R2: W2) (Figure r1)
  3. Long-range evaluation on chromatin activity prediction (R1: W1) (Table r2)
  4. Evaluation of RNA sequence modeling ability (R2: Q2) (Table r3)
  5. Evaluation of larger HyenaDNA model (R3: Q3) (Table r4)
  6. Evaluation of different values of K in K-mer tokenization (R4: S2) (Table r5)

The results of t-SNE clustering are in the PDF file and the additional results tables are in both the PDF file and in the response.

Method Section Revision:

For the clarity problem of the method section, we modify the two subsections the method section of the main paper, adding higher level description as follows:

Basic Units Recognition

Basic Units Scoring Initially, MxDNA identifies the basic units as building blocks of our tokens. Analogous to bounding boxes proposal in object detection, MxDNA estimates the probability of the existence of various sized basic units at each nucleotide position. This is achieved by a linear gating mechanism commonly employed in Mixture of Experts models. Following this, one-dimensional non-maximum suppression is applied to eliminate redundant proposals and select the most significant basic units.

Specifically, given the input nucleotide sequence XRl×dX \in \mathbb{R}^{l \times d}, where ll is the sequence length and dd is the hidden dimension, we first linearly score XX to produce SRl×nS \in \mathbb{R}^{l \times n}, where nn represents the number of experts. This scoring incorporates multiplicative jitter noise, introducing the Ambiguous property. We then apply a modified non-maximum suppression to SS, where SijS_{ij} indicates the presence score of a basic unit of length LjL_j centered at position ii, and LNnL \in \mathbb{N} ^ n is a predefined set of lengths. The results are tracked using an expert mask MNlM \in \mathbb{N}^l, where each MiM_i is a natural integer indicating the presence of a basic unit's center of length MiM_i at position ii.

Basic Units Embedding Subsequently, after estimating the existence of the basic units, we aggregate the nucleotides within each unit to form their embeddings. Giving the advantage of capturing local features, convolution kernels of corresponding sizes are applied at the center of each basic unit. The initial scoring, followed by gating to specific convolution experts, is similar to the Mixture of Experts paradigm, though each expert here is a convolutional expert focusing on a specific segment rather than a single nucleotide.

Specifically, a basic unit at position ii of length Lj=MiL_j = M_i is processed by the convolution expert EjE_j with kernel size LjL_j, and weighted by softmax(Si)j\text{softmax}(S_i)_j, thus aggregating the nucleotides within the unit. This procedure transforms the original input XRl×dX \in \mathbb{R}^{l \times d} into an array of basic units URl×dU \in \mathbb{R}^{l \times d}:

Equation 1 in original paper.

To achieve sparse activation of convolution experts efficiently, we extract the nucleotides within basic units of the same length, apply convolution operation with stride equal to kernel size and place the output with reduced length back to the corresponding center positions. Then, the unwanted entries {iMi=0}\{ i | M_i = 0 \} of are removed to keep the basic units URk×dU \in \mathbb{R}^{k \times d} only, where kk is the number of basic units.

Basic Units Assembly

Distal Relation Estimation Building upon the identified basic units, we address the more complex genomic patterns that extend beyond simple segmentation through the application of one-dimensional deformable convolution. This technique uniquely accommodates the modeling of complex local geometric transformations, adaptively adjusting to the input sequence. The linkages between the distal basic units are initially modeled by the offsets and modulation factors of each basic unit.

Specifically, following Deformable Convolution, we compute offsets ΔPRk×f\Delta P \in \mathbb{R}^{k \times f} and modulation factors ΔMRk×f\Delta M \in \mathbb{R}^{k \times f} based on the basic units UU to model the distal relationships among them. This strategy ensures that the combination of distal basic units addresses the Discontinuous property and the reuse of basic units across different tokens meets the Overlapping property.

Final Tokens Embedding Utilizing the calculated offsets and modulations, we apply deformable convolution to embed the basic units into final tokens accordingly. The embedding process for each position incorporates deformations of the convolution kernel specified by the offsets, with the resulted modulated by the modulation factors.

Specifically, we apply an one-dimensional deformable convolution with kernel size ff to embed these basic units into the final learnt tokens TRk×dT \in \mathbb{R}^{k \times d}. The token embedding for each position ii is formulated as:

Equation 2 in original paper.

For a fractional location p=i+p+Δpp' = i+p+\Delta p, bilinear interpolation is as follows:

Equation 3 in original paper.

最终决定

The work introduces a novel framework for learning DNA sequence tokenization. There is a clear consensus around the novelty of the work and its potential impact to the field of DNA-language model. The authors have also adequately addressed concerns related to the presentation and the inclusion of additional benchmarks, meeting the reviewers' expectations.