Gumbel-Softmax Discretization Constraint, Differentiable IDS Channel, and an IDS-Correcting Code for DNA Storage
摘要
评审与讨论
This paper investigates the use of Gumbel-Softmax as a discretization constraint within an autoencoder framework for designing insertion, deletion, and substitution (IDS) error-correcting codes, specifically targeting DNA data storage. Additionally, the paper presents a differentiable IDS channel that is designed to model the error patterns encountered in DNA storage, thus allowing the IDS-correcting code to be trained and tested under conditions mimicking IDS channels.
优点
-
The paper explores Gumbel-Softmax as a discretization mechanism in a non-generative model, which is interesting.
-
The differentiable IDS channel is interesting as it allows learning IDS codes tailored to specific IDS channels.
缺点
While the paper proposes several contributions, they are somewhat loosely connected, giving the impression that they are brought together primarily to justify a full conference paper submission. The title of the paper, ``Gumbel-Softmax discretization constraint, differentiable IDS channel, and an IDS-correcting code for DNA storage,'' reflects this bundling.
The first contribution of using Gumbel-Softmax for discretization has some merit in itself but is not sufficiently substantial for a standalone publication (furthermore, I do not think that this contribution is entirely new, see Comment 3 below).
The remaining contributions focus on designing IDS-correcting codes for DNA storage. And in this sense, the paper is weak, showcasing notable limitations that impact its relevance and applicability in DNA storage.
Hence, in my opinion, the paper should be rejected. I motivate my decision in the following.
- Lack of realistic DNA storage channel modeling: As the authors are aware, the literature on error correcting coding for DNA storage is extensive, and the paper rightly highlights the limitations of most works in this area (i.e., most works consider simplistic, unrealistic channel models). In this sense, designing error-correcting codes for a realistic DNA storage channel is both timely and relevant. However, this paper falls itself short of achieving this goal, i.e., it falls in the same trap (unrealistic channel models) that authors criticize.
Specifically, the authors overlook several recent works on error correcting codes for realistic DNA channel models, see, e.g.,
Maarouf et al., ``Concatenated codes for multiple reads of a DNA sequence,'' IEEE Trans. Inf. Theory, 2023.
B. Hamoum, et al., ``Channel model with memory for DNA data storage with nanopore sequencing,'' ISTC 2021.
Srinivasavaradhan et al, ``Trellis BMA: Coded trace reconstruction on IDS channels for DNA storage,'' ISIT 2021.
and
Welter et al, "An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads," arXiv, June 2024.
Notably, the last two references include results for actual DNA synthesis and sequencing, and, to the best of my knowledge, the last paper presents most realistic DNA storage channel to date.
In this sense, a key limitation is that the paper does not address a realistic DNA storage channels, which should include multiple reads, non-iid errors, and strand-dependent errors. Without these, the paper fails to demonstrate that the proposed IDS-correcting codes are applicable to DNA storage.
If the paper intends to target DNA storage applications, it is imperative to consider a realistic channel model. A realistic channel model (for Oxford nanopore sequencing, which is state-of-the-art) is described in:
Welter et al, "An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads," arXiv, June 2024.
- Absence of comparison with state-of-the-art codes: The paper does not compare the performance of the THEA-code against existing state-of-the-art IDS-correcting codes specifically designed for DNA storage. Without this comparison, it is difficult to assess whether the proposed codes achieve competitive performance (particularly for real-world DNA storage applications).
In particular, the authors should consider comparing the performance of their codes with the coding scheme in
Welter et al, "An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads," arXiv, June 2024.
and other existing schemes (see Figure 5 in the paper for references).
- The analysis of the discretization effect of applying Gumbel-Softmax in a non-generative model is interesting. However, while I am not an expert in the topic, I am not sure that this is entirely new. In particular, the authors claim "it is found that applying Gumbel-Softmax in a non-generative model induces the categorical distribution to resemble a one-hot vector". It appears to me that this novelty is overstated
Indeed, I believe that this property was already known, see Jang et al. ``Categorical reparameterization with Gumbel-Softmax'' (see Figure 1, for example).
While the application of Gumbel-Softmax for IDS coding is new, the fundamental behavior of Gumbel-Softmax producing one-hot-like distributions in non-generative settings is not new.
问题
-
The authors should reword their experimental results section and compare the performance of their THEA-code to state-of-the-art ECCs for DNA storage. Without such a comparison, the relevance of this work is unclear.
-
In a realistic DNA storage channel, IDS errors are non-iid and are data-dependent. Can the differentiable IDS channel be extended to this setting?
We would like to thank the reviewer for his/her valuable comments, which will undoubtedly help improve the quality of our work. Below, we provide detailed responses to each of the raised concerns. We hope the reviewer will consider revising the rating to a more positive score based on these clarifications and improvements.
Q1: Reword the experimental results and have more comparative experiments.
We apologize for the confusion caused by the presentation of the experimental results. To better clarify the results and highlight the advancements made by the proposed method, we will revise the experimental section and include additional comparative experiments.
The reviewer may refer to the separate rebuttal Section Comparative Experiments below, as well as the revised PDF, which will be provided later, for further details.
Q2: Can the differentiable IDS channel be extended to realistic DNA storage channel?
This is an insightful question. It was widely believed that the biochemical procedures involved in DNA storage introduce errors that follow specific patterns or distributions, such as energy constraints or the avoidance of long homopolymer runs, etc. However, there were few detailed, standardized descriptions of the exact principles governing errors. This may be due to the complexity of the underlying processes, which are often equipment-dependent and may vary across different experimental setups.
The answer is yes to this question that, the differentiable IDS can be extended to any IDS channel. It only requires a customized generation method for the error profile that corresponds to the specific IDS channel. The proposed differentiable IDS method accurately introduces errors to the sequence based on the predefined error profile . Once the error patterns of your IDS channel are well-defined, you can generate the profile according to these rules, and the differentiable IDS will faithfully replicate the behavior of your IDS channel.
However, accurately defining the error rules is challenging. Can we apply the proposed differentiable IDS to undefined realistic channels? As briefly mentioned in the manuscript, generative deep learning models have been used to simulate how DNA sequences are transformed during DNA storage, such as in
- S. Kang, et.al, Generative adversarial networks for DNA storage channel simulator 2023.
By incorporating the differentiable IDS as a module within such generative models, the generative model may learn to generate error profiles, rather than directly generating retrieved DNA sequences. This may alleviate the burden of the model. While this is a straightforward application of differentiable IDS, it falls outside the scope of this current work.
Comment 1. The proposed contributions are somewhat loosely connected.
We agree with the reviewer that, from the perspective of bioinformatics or DNA storage, the contributions may seem loosely connected. However, we believe that, from the standpoint of machine learning and learning representation, these contributions are essential and closely aligned with overcoming the challenges of developing an autoencoder-based IDS correcting code.
The proposed Gumbel-Softmax Discretization Constraint induces a hard representation from the deep learning features. We believe that this approach helps bridge the gap between continuous deep models and discrete applications, and it could be extended to other tasks that face similar discretization challenges.
The proposed Differentiable IDS channel introduces a model that mimics IDS operations while preserving the ability for gradient back-propagation. First, it demonstrates that the Transformer can efficiently mimic IDS operations. Second, it serves as a general module for neural networks, applicable to various tasks where discrete IDS operations play a role.
We kindly remind the reviewer that, since this is a submission to a machine learning conference, the novelties should perhaps be re-evaluated from the perspectives of both the machine learning community and the DNA storage community.
Comment 2. Lack of realistic DNA storage channel modeling.
There is no doubt that the recommended papers highlight key trends in DNA storage. We apologize for not including them in our manuscript. We will ensure to properly cite them in the revised version.
As a next-generation information storage scheme, DNA storage is a complex, multi-step process. The key stages of the data processing pipeline include Steps in:
- Step 1: Digital raw signals from Oxford Nanopore sequencing;
- Step 2: A bucket of retrieved DNA sequences after basecalling from the raw signal;
- Step 3: Clustered DNA sequences from the retrieved DNA sequences;
- Step 4: Reconstructed DNA references estimated from clusters;
- Step 5: Decoded information derived from the reconstructed DNA references.
Each of these steps, or combinations of them, can serve as a research topic. For example,
- The deep learning-based basecaller, which tackles Step 1-2
- [1] Dorado: A neural network model for nanopore sequencing.
- The deep learning-based trace reconstruction based on raw signals, which tackles Steps 1-3,
- [2] Anonymous. Beyond the Alphabet: Deep Signal Embedding for Enhanced DNA Clustering , ICLR2025 submission.
- The recommended works directly decode information from the sequence cluster without requiring an estimation of the reference sequence, which tackles Steps 3-5.
- [3] Maarouf, Issam, et al. Concatenated codes for multiple reads of a DNA sequence. IEEE TIT (2022).
- [4] Welter et al, An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads, arXiv, June 2024.
There is no doubt that the novel methods in [2-4], which propose end-to-end approaches for the combined procedures in Steps 1-5, are valuable. However, this should not imply that works addressing the individual steps are meaningless.
Our work focuses on proposing an IDS-correcting code that can be easily customized for specific IDS channel settings, addressing Step 5 mentioned above.
There is no doubt that the channels tested in this work are diverse but not fully realistic, and we have not considered the potential benefits of decoding multiple retrieved sequences.
However, we would like to clarify that the proposed method is not an end-to-end DNA storage work intended for bioinformatics journals. Rather, it is a machine learning study that focuses on the techniques proposed for developing an autoencoder-based IDS code. Moreover, conducting two rounds of wet-lab experiments—one for investigating channel characteristics and another for validating code performance—is beyond our current resources.
We kindly ask the reviewer to reconsider the novelty of this work from the perspectives of both the machine learning community and the DNA storage community, as this aligns more closely with the topics of ICLR.
Comment 3. Absence of comparison for assessing the proposed work.
We acknowledge our oversight in not providing sufficient comparison analysis in the original manuscript. To address this, we have added a section on comparison experiments to evaluate our work. Please refer to the rebuttal section Comparison Experiment below or the revised PDF, which will be provided later, for further details.
After thorough searching, we found that only a few works are able to be aligned with the same setting and offer publicly accessible implementations. Regarding the novel work suggested by the reviewer for comparison,
- Welter et al, An End-to-End Coding Scheme for DNA-Based Data Storage with Nanopore-Sequenced Reads, arXiv, June 2024.
this recommended method primarily focuses on decoding multiple retrieved sequences, which can be seen as an alternative to the conventional DNA trace reconstruction + IDS-correcting procedures. While related, it is not identical task to our work, which specifically addresses IDS-correcting codes. Furthermore, the source code for this recommended work has not been released, and it would be too rushed to implement this method by ourself within the rebuttal period.
Moreover, according to the “AC Guidelines ICLR 2025”, works that are published within the last four months before submission deadline are not expected to discuss in the submitted paper.
Q: Are authors expected to cite and discuss very recent work?
A: We consider papers contemporaneous if they are published within the last four months.
The recommended paper is submitted to arXiv in June 2024, within the four months before ICLR submission deadline in 1st Oct 2024.
We will still cite this related work properly.
I would like to thank the authors for their efforts in addressing my concerns and for providing additional details in the rebuttal. Unfortunately, my original assessment remains.
I offer some further feedback below.
Regarding your comment, "We kindly ask the reviewer to reconsider the novelty of this work from the perspectives of both the machine learning community and the DNA storage community, as this aligns more closely with the topics of ICLR":
Please rest assured that I evaluated the novelty of this work with both perspectives in mind. While I recognize that the paper makes some contributions and presents a creative approach, I believe these contributions are relatively modest in scope and, therefore, are better suited for a second-tier conference rather than a venue as prestigious as ICLR. To be clear, this is not a critique of the validity or execution of the paper--there is nothing fundamentally wrong with the work itself. Rather, it is a matter of whether the contribution is significant enough to merit publication at this level.
The contribution appears to be strongly tied to DNA storage, as emphasized in the title and narrative of the paper. In this context, I find the relevance of the work to be questionable, particularly when considering the extensive body of existing research on coding for DNA storage that addresses much more realistic channel models. My initial concerns remain unaddressed: although I appreciate the additional comparisons presented in the rebuttal, the work does not convincingly demonstrate its impact relative to the prior research on realistic DNA storage channel models.
I would like to highlight that testing your approach on a more realistic DNA storage channel does not require wet-lab experiments. It is feasible to simulate such channels computationally, making the absence of such evaluations in your current work a missed opportunity. This makes the comment in your rebuttal about the challenges of realistic channel assessment seem somewhat misplaced.
Finally, note that I also provided references to papers that have been published far before four months before the submission deadline (of the same and other authors).
To summarize: while your approach has potential, it does not, in its current form, meet the threshold of contribution and impact necessary for ICLR. I encourage you to continue developing this line of work and perhaps submit it to a venue that is a better fit for the scope and relevance of your contributions.
Thank you for your comments. We will consider incorporating simulated channels in the next version.
Comment 4. Novelty of applying Gumbel-Softmax Discretization Constraint.
We apologize that our manuscript led to a misunderstanding by the reviewer.
The Gumbel-Softmax was originally introduced to enable sampling from categorical distributions in VAEs, thereby enhancing their generative capabilities. In such contexts, researchers typically encourage the distribution to maintain high entropy (i.e., avoid a one-hot style distribution) to prevent model degeneration.
In our case, we found that applying the Gumbel-Softmax effectively constrains its input vector to approximate a one-hot vector during training, and we provided a brief mathematical analysis to support this observation. This use diverges from the original sampling task, with the only similarity being the shared formula. To the best of our knowledge, no existing research has explored this specific approach.
It is true that Figure 1 in the original paper on applying Gumbel-Softmax in deep learning, “Categorical reparameterization with Gumbel-Softmax”, appears to illustrate quantization. However, the Figure 1 is actually focused on how the hyperparameter temperature influences the sampling process. Specifically, if represents a sample drawn from the Gumbel-Softmax distribution, Figure 1 demonstrates how the temperature controls the 'hardness' of the sample —from a hard sample of 'one-hot' vector () to a soft sample of 'uniform' distribution ().
However, in our work, the Gumbel-Softmax is applied to induce the input of to resemble a one-hot vector. When applied as a constraint in a non-generative network, Gumbel-Softmax drives the input to a harder form. For instance, by applying the constraint, is more like the form (0,1,0,0) rather than (0.2,0.3,0.1,0,4), where the former corresponds directly to a nucleotide base T in ATGC, while the latter is difficult to interpret as a single base.
In summary, the role of Gumbel-Softmax in this work differs significantly from its original purpose. In fact, framing Gumbel-Softmax as a sampling method does not align with the proposed Gumbel-Softmax Discretization Constraint. We also intuitively suspect that introducing other sources of noise, aside from Gumbel-Softmax, would produce a similar effect in inducing hard features.
Comparison Experiment
We must admit our fault for not providing sufficient comparison analysis in the original manuscript.
We would like to once again emphasize the challenges in conducting comparison experiments. Many related works are difficult to align with the same setting, and few have made their source code publicly available, often due to commercial reasons.
To evaluate the effectiveness of the proposed methods, we conducted comparison experiments against three prior works:
- A combinatorial code that can correct single IDS errors over a 4-ary alphabet:
- K. Cai et.al, Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE TIT
- A well-known, efficient heuristic method, HEDGES, in DNA storage:
- W.H. Press, et. al, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. PNAS
- A segmented method for correcting multiple IDS errors, called DNA-LM:
- Z.H. Yan, et.al, A segmented-edit error-correcting code with resynchronization function for DNA-based storage systems. IEEE TETC
These methods typically offer only a few discrete, fixed configurations. We made efforts to align their settings as closely as possible. For DNA-LM, we maintained the codeword length around 150, adjusting the number of segments to match code rates. For Cai’s combinatorial code, the code rates are fixed based on the code lengths. In our experiments, only the code rates are matched, with lengths at each code rate set as follows: 21@0.33, 32@0.50, 48@0.67, 102@0.83. It is important to note that code length plays a critical role in the experiments, as longer codewords are more likely to encounter multiple errors that cannot be corrected. Thus, Cai’s performance here is just a baseline statistic of multi-errors with respect to the length, and performance may degrade with increased length. For HEDGES, only binary library is publicly available, and it supports fixed code rates in [0.75, 0.6, 0.5, 1/3, 0.25, 1/6]. HEDGES’ inner code was tested independently for comparison.
The results are illustrated below and will be presented as a figure in the revised manuscript, as the code rates are not fully aligned. These numbers represent the decoding error rate at a 1% IDS channel error probability for these methods, with respect to the different code rates. In addition to the standard 1% error channel, we also conducted experiments on the ASC and DES channels.
| code rate | 0.33 | 0.50 | 0.6 | 0.67 | 0.75 | 0.83 |
|---|---|---|---|---|---|---|
| Cai | 0.44 | 1.00 | 2.53 | 8.65 | ||
| DNA-LM | 0.70 | 1.32 | 3.11 | 9.10 | ||
| HEDGES | 0.28 | 0.25 | 0.65 | 3.43 | ||
| THEA | 0.09 | 0.46 | 1.06 | 2.81 | ||
| Cai ASC | 0.11 | 0.41 | 1.29 | 5.72 | ||
| DNA-LM ASC | 0.93 | 1.71 | 3.69 | 10.20 | ||
| HEDGES ASC | 0.25 | 0.27 | 0.61 | 2.89 | ||
| THEA ASC | 0.05 | 0.30 | 0.79 | 1.83 | ||
| Cai DES | 0.48 | 1.49 | 3.36 | 11.10 | ||
| DNA-LM DES | 0.90 | 1.77 | 4.03 | 10.89 | ||
| HEDGES DES | 0.00 | 0.01 | 0.25 | 2.00 | ||
| THEA DES | 0.06 | 0.32 | 0.94 | 2.12 |
The results for Cai’s method indicate that directly applying classical combinatorial codes to a 1% IDS error probability channel with a codeword length of 150 is impractical. The observed error rates are high, even though these values were obtained with shorter code lengths than 150. The segmented method with sync markers in DNA-LM supports a codeword length of 150 and can correct multiple errors across different segments. However, it also exhibits a high error rate, indicating a nonnegligible likelihood of multi-errors occurring within the same segment. For HEDGES, while the results are commendable, the code rate is restricted to a limited set of fixed values. The results of THEA demonstrate the effectiveness of the proposed method. At lower code rates, THEA achieves a comparable error rate to HEDGES. At higher code rates, the proposed method outperforms HEDGES, achieving a lower error rate at a higher code rate, specifically 2.81%@0.83 for THEA vs 3.43%@0.75 for HEDGES.
Please see my answer to Rebuttal 2
Thank you for providing additional clarification regarding your use of the Gumbel-Softmax Discretization constraint. I appreciate the effort to explain its novelty and divergence from its original purpose. However, I believe this remains a relatively minor contribution.
This study explores a learning-based code design for the IDS channel. Given that the IDS channel is discrete, the authors first propose a differential IDS channel, using transformer-based models equipped with logical reasoning capabilities. Additionally, a Gumbel-softmax discretization constraint is applied to enhance the discreteness of the codewords.
优点
The motivation for this work is clear, and its potential impact on the coding community could be substantial, as fully automated code/decoder design for discrete channels has not been extensively studied. The methods employed leverage recent advances, including transformer and Gumbel-softmax. Considering that practical DNA channels cannot be modeled with a rigorous mathematical framework, the ability to generate codes and decoders for arbitrary channels is highly meaningful and beneficial.
缺点
Despite the creativity and necessity of this approach, evaluating the effectiveness of this work remains challenging. While I understand the difficulty of comparing results with other studies in DNA coding, this work lacks any comparative analysis with prior research. The authors discuss an ablation study and provide simulation results, but there is no assessment of the proposed methods' effectiveness compared to other works. In the abstract, the performance of the proposed codes is described vaguely as "commendable." Although the automation ability is undoubtedly advantageous, a discussion on relative performance and complexity is essential, given that this research addresses a channel coding problem.
问题
Below are several questions I have (most of them may be due to my own misunderstandings, and I would be grateful for clarification to aid readers who may struggle to understand as I have):
- In figure 1, the probability vector C is shown to have a length of 5. Why, then, does the error profile have a length of 6?
- The authors mention that the DIDS channel is implemented with a transformer-based model. It would be helpful to include the transformer-based model architecture for the DIDS channel in Figure 1, as there seems to be sufficient space to add it. Additionally, is the use of the transformer model essential, or merely effective? Would it be possible to employ other neural network architectures?
- In Section 5.2, I am unclear about the reason for introducing auxiliary reconstruction. Although Section 6.3 demonstrates the necessity of this method through simulations, the motivation behind it remains unclear.
- In Section 5.3, the authors mention that the encoder and decoder are also based on a transformer model. For the channel itself, using a transformer model might be acceptable, as the channel conditions are given in real-world applications, and thus the higher complexity of a transformer (relative to a typical neural network) is not problematic. However, for the encoder and decoder, using a transformer-based network could impose a significant complexity burden compared to the feed-forward neural networks commonly used in autoencoder research. Therefore, it would be beneficial to discuss whether a transformer model is strictly necessary for this purpose, and, if so, to provide an analysis of the associated complexity.
- In Section 5.5, it appears that the codeword c is represented as a one-hot vector during testing, which could result in an excessively low code rate. However, based on Section 6, where the lengths of the source and codeword are given as 100 and 150, respectively, it seems that the codeword is not a one-hot vector. It would be helpful if the authors could clarify this. In Line 404, the codeword is described as "behaving like a one-hot vector," which led me to believe it was one-hot encoded.
- What is the meaning of the color gradient in Figure 4? Does a darker color indicate a higher gradient? I am also struggling to interpret this figure. For example, does the "del(+3)" entry correspond to deletion at idx 0 (as mentioned in the caption)? If so, why does this imply a substitution at idx+3?
- Table 2 presents experimental results for the Asc and Des channels. Why does the Asc channel exhibit better performance? Since Asc and Des channels are symmetric, shouldn't they achieve similar performance?
We thank the reviewers for their valuable comments. Below, we provide detailed responses to address each concern, and we hope the reviewer will consider raising the ratings to a positive score.
Q1. Error profile has length larger than the codeword.
This is due to the sync error, particularly the insertion, occurs without moving the index pointer to the next position. For example, inserting CCC at position 1 in a length-4 sequence ‘ATGC’ to obtain ‘A CCC TGC’ requires a length-7 profile of (0, insert C, insert C, insert C, 0, 0, 0).
Q2. Why using transformer for DIDS?
We did not explore diverse architectures for DIDS because the transformer-based model is currently the most suitable choice for our data structure. As mentioned in our response to Q1, the error profile often differs in length from the codeword, creating asynchrony that many neural network architectures, such as CNNs, struggle to handle. In contrast, the transformer model naturally captures global relationships within the input and is better suited for handling multi-input scenarios, such as the codeword and profile, through individual embeddings. Furthermore, as the reviewer noted, this DIDS model is effective. Therefore, we opted to use this reasonable and effective model without exploring alternative architectures, as doing so would deviate from the primary focus of this work.
We will revise the manuscript to provide a clearer explanation of how the transformer model works.
Q3. Why using the auxiliary reconstruction?
As mentioned in Line 287 and Section 5.2, the autoencoder (AE) does not naturally align with error-correcting codes (ECC). While conventional AEs focus on feature compression, ECCs are designed to generate parity checks. As demonstrated in the ablation study, directly optimizing Equation (15) does not enable the AE to behave as an ECC in our experiments.
Inspired by systematic codes, which embed the input message within the codeword, we introduced an auxiliary reconstruction task to enable the encoder to replicate the input message. As for whether the resulting code is systematic or not, it is determined by the model during training and is challenging to confirm. Variations in base mappings, permuted messages, and positional choices make it impractical to verify all possible systematic cases exhaustively.
Q4. Why transformer-based encoder/decoder. & Time Complexity.
The reasoning behind using a transformer-based encoder/decoder is similar to and partially addressed in our response to Q2. First, the model needs to handle inputs of variable lengths, as sync errors can alter the codeword length. Second, we aim to leverage the transformer’s logical capabilities, which may be suited to mimic some of the combinatorial rules underlying established codes, such as the VT code.
It is well known that the transformer has complexity. We acknowledge the limitation of not exploring the numerous efficient transformer variants. However, we wish to emphasize the novelties of our contributions, which make AE-based IDS-correcting codes possible, as they are: the Gumbel-Softmax Discretization Constraint and the Differentiable IDS channel. As this is an emerging topic, there are many potential directions to explore, and it may not be feasible to investigate all of them within a single work.
Given the parallelization capabilities of GPUs and the codeword length limited by current biotechnology, the transformer's complexity results in acceptable time consumption. We decoded 1,280,000 codewords on an RTX 3090, with the results listed below (the encoder, which shares the same structure, exhibits similar performance):
| message length | 50 | 75 | 100 | 125 |
|---|---|---|---|---|
| time (s) | 521.94 | 573.87 | 623.92 | 687.76 |
This table will be included in the revised PDF.
Q5. The one-hot vector sequence and the letter sequence.
We apologize for the misunderstanding and will revise the manuscript for clarity.
The output of the encoder and the input of the decoder are represented as probability vectors or one-hot vectors. This setup allows for gradient backpropagation, enabling joint optimization of the encoder and decoder. If letter representations were used between encoder and decoder in training phase, the letter embedding in the decoder would block the gradient.
During testing, the encoder output is processed by before being input into the decoder, where represents the codeword in familiar letter-sequence form. In Line 404, the Gumbel-Softmax constraint is applied to force the encoder's output from a probability vector to a more one-hot-like vector. This constraint eliminates the discrepancy between and during testing (if in one-hot form, then , if in non-one-hot form, then . ). Thus, this constraint is a key contribution of our work.
Q6. Gradient with regard to DIDS.
We apologize for the misunderstanding and will revise the manuscript for clarity.
Yes, in Figure 4, darker shading indicates a higher absolute value of the gradient. In case del(+3), deletion occurs at idx 0, and the gradient is applied on the output of DIDS at idx +3. In this subfigure, the back-propagated gradient is lighted at position idx +4 with respect to the input, which is reasonable because the DIDS deleted the letter at idx 0, causing all remaining letters to shift one index to the left in the DIDS output.
Q7. Different results between ASC and DES.
Although the transformer provides a global sight, the sequence itself has distinct endpoints when processed by a computer. It must be acknowledged that both DIDS and the transformer’s positional embedding—whether learnable or not—impose a directional bias once the data is input starting from index 0. Meanwhile, sync errors accumulate asynchrony along the sequence. As a result, DES is more likely to exhibit errors at the beginning of the sequence, making it a more challenging channel than ASC under the proposed setting.
Comparative Experiment
We must admit our fault for not providing sufficient comparative analysis in the original manuscript.
We would like to once again emphasize the challenges in conducting comparative experiments. Many related works are difficult to align with the same setting, and few have made their source code publicly available, often due to commercial reasons.
To evaluate the effectiveness of the proposed methods, we conducted comparative experiments against three prior works:
- A combinatorial code that can correct single IDS errors over a 4-ary alphabet:
- K. Cai et.al, Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE TIT
- A well-known, efficient heuristic method, HEDGES, in DNA storage:
- W.H. Press, et. al, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. PNAS
- A segmented method for correcting multiple IDS errors, called DNA-LM:
- Z.H. Yan, et.al, A segmented-edit error-correcting code with resynchronization function for DNA-based storage systems. IEEE TETC
These methods typically offer only a few discrete, fixed configurations. We made efforts to align their settings as closely as possible. For DNA-LM, we maintained the codeword length around 150, adjusting the number of segments to match code rates. For Cai’s combinatorial code, the code rates are fixed based on the code lengths. In our experiments, only the code rates are matched, with lengths at each code rate set as follows: 21@0.33, 32@0.50, 48@0.67, 102@0.83. It is important to note that code length plays a critical role in the experiments, as longer codewords are more likely to encounter multiple errors that cannot be corrected. Thus, Cai’s performance here is just a baseline statistic of multi-errors with respect to the length, and performance may degrade with increased length. For HEDGES, only binary library is publicly available, and it supports fixed code rates in [0.75, 0.6, 0.5, 1/3, 0.25, 1/6]. HEDGES’ inner code was tested independently for comparison.
The results are illustrated below and will be presented as a figure in the revised manuscript, as the code rates are not fully aligned. These numbers represent the decoding error rate at a 1% IDS channel error probability for these methods, with respect to the different code rates. In addition to the standard 1% error channel, we also conducted experiments on the ASC and DES channels.
| code rate | 0.33 | 0.50 | 0.6 | 0.67 | 0.75 | 0.83 |
|---|---|---|---|---|---|---|
| Cai | 0.44 | 1.00 | 2.53 | 8.65 | ||
| DNA-LM | 0.70 | 1.32 | 3.11 | 9.10 | ||
| HEDGES | 0.28 | 0.25 | 0.65 | 3.43 | ||
| THEA | 0.09 | 0.46 | 1.06 | 2.81 | ||
| Cai ASC | 0.11 | 0.41 | 1.29 | 5.72 | ||
| DNA-LM ASC | 0.93 | 1.71 | 3.69 | 10.20 | ||
| HEDGES ASC | 0.25 | 0.27 | 0.61 | 2.89 | ||
| THEA ASC | 0.05 | 0.30 | 0.79 | 1.83 | ||
| Cai DES | 0.48 | 1.49 | 3.36 | 11.10 | ||
| DNA-LM DES | 0.90 | 1.77 | 4.03 | 10.89 | ||
| HEDGES DES | 0.00 | 0.01 | 0.25 | 2.00 | ||
| THEA DES | 0.06 | 0.32 | 0.94 | 2.12 |
The results for Cai’s method indicate that directly applying classical combinatorial codes to a 1% IDS error probability channel with a codeword length of 150 is impractical. The observed error rates are high, even though these values were obtained with shorter code lengths than 150. The segmented method with sync markers in DNA-LM supports a codeword length of 150 and can correct multiple errors across different segments. However, it also exhibits a high error rate, indicating a nonnegligible likelihood of multi-errors occurring within the same segment. For HEDGES, while the results are commendable, the code rate is restricted to a limited set of fixed values. The results of THEA demonstrate the effectiveness of the proposed method. At lower code rates, THEA achieves a comparable error rate to HEDGES. At higher code rates, the proposed method outperforms HEDGES, achieving a lower error rate at a higher code rate, specifically 2.81%@0.83 for THEA vs 3.43%@0.75 for HEDGES.
Thank you for providing your response. Although the authors offer adequate answers, I still lack confidence that this work is suitable for presentation at a top-tier venue like ICLR. The use of the transformer architecture feels unnatural and seems disconnected from prior works rather than sequentially building upon them. I believe the paper would have been stronger if it started with a fundamental auto-encoder architecture and progressively demonstrated how their proposed methods design a code suitable for the DNA channel, eventually applying the transformer architecture later in the narrative. Additionally, I still have doubts about whether the comparisons with existing methods are fair. For these reasons, I will maintain my original score.
We would like to thank the reviewer for the thoughtful feedback in this new round. Below is our response.
Comment 1. The use of the transformer architecture feels unnatural and seems disconnected from prior works rather than sequentially building upon them.
Transformer-based approaches have been applied to error correction codes in many works, demonstrating strong performance. For example, ECCT in NeurIPS 2022 and subsequent works in ICLR 2023, ICLR 2024, AAAI 2024, and ICML 2024 have shown the effectiveness of such architectures. Meanwhile, given the logical capabilities of Transformers, we believe it is a straightforward choice to apply Transformer architectures to our IDS-correcting code.
As highlighted in the manuscript, there are very few works addressing multiple IDS error correction using deep learning methods. Most existing approaches are heuristic. It is challenging to identify prior work that our method sequentially builds upon. However, we argue that even in the absence of a direct predecessor, our proposed method demonstrates outstanding performance. Notably, the proposed approach can generate channel-specific codes, a capability that is beyond most existing codes.
Comment 2. I still have doubts about whether the comparisons with existing methods are fair.
It is true that the comparison to the combinatorial single error-correcting code is unfair. The combinatorial code has a shorter sequence length, which reduces the likelihood of encountering multiple errors and can artificially inflate its performance in such comparisons. However, this unfair comparison actually highlights the strength of the proposed method. Even under these conditions, where the single error-correcting code has an advantage, the proposed approach still outperforms the combinatorial codes.
In fact, single IDS-correcting codes fail when encountering more than one error. The NER of such combinatorial codes can be analytically derived without requiring experiments. For example, consider 1% deletion on a 150-length sequence. The probability that more than one error occurs is approximately %. Treating the received sequence as the decoded sequence keeps the deletion shift, which introduces nearly half-length bits as errors on average. The final NER will be on the order of 10%.
This paper deals with coding for DNA storage. This problem is very popular in the research community as DNA based storage has a huge capacity. Sequencing operation is error prone and the errors which we can observe are not usual substitution errors but insertions and deletions. So, in order to deal with such errors new coding methods are required. Actually, such errors were already considered in the literature by Varshamov, Tenengoltz, Levenstein and others, these errors are a particular case of so-called asymmetric errors. But the problem is solved only for insertion or deletion of 1 symbol. The authors address the problem and propose new autoencoder-based codes that can deal with insertions, deletions and substitutions (IDS).
优点
The main strong point are as follows:
- the authors propose to use end-to-end autoencoder-based solution as an as an IDS-correcting code
- new differentiable channel model, called DIDS channel
- the use of Gumbel-Softmax for error-correcting coding task
- the final result is the code that can deal with IDS - to the best of my knowledge this is the first such code in the literature
缺点
I list the main weaknesses below:
- I would suggest to extend the section related to experiments and comparisons. The authors only present the results for their code for the error probability 1%. In this situation it is extremely difficult to understand how good the obtained codes are. What will change for larger error probabilities?
- «combinatorial codes (Cai et al., 2021; Gabrys et al., 2023) typically correct a single error rather than errors that can occur at multiple positions». But this is standard in error-correcting literature. E.g. BCH codes are designed to correct t errors but are applied in the channels with random errors. If the number of errors is greater than t you just declare a failure and take the received sequence as output.
- I would ask to make the comparison to classical combinatorial codes. E.g. there are VT codes that can deal with 1 deletion. Could you please compare you codes and VT codes in the channel with random deletions. I believe they will work fine for 1% deletion probability and length 150.
- Please provide some analysis of the THEA-Code, e.g. the distance (Levenstein distance). Also it would be beneficial to compare its cardinality with fundamental bounds from the literature (the capacity of deletion channel in not known but the bounds do exist).
- The usual problem for DNN-based decoding is the curse of dimensionality. In other words you cannot show all the codewords to the encoder/decoder. How did you solve the problem and how your decoder works on the codewords that were not shown to it during the training process?
- What is the purpose to introduce NER*?
问题
See weaknesses
We are thankful for the reviewer’s valuable comments. Below, we provide detailed responses to address each concern. We hope the reviewer will consider raising the score to a positive rating.
Q1: Comparison experiments. & What will change for larger error probabilities?
We agree that providing results for a broader range of experimental settings is essential. As suggested by the reviewer, we have extended the experiments in two aspects.
First, we conducted comparison experiments with the combinatorial code from Cai, the segmented code DNA-LM, and the well-known heuristic code HEDGES, despite the challenges in aligning these methods under the same settings. We have a separate rebuttal page on this, the reviewer may refer to the rebuttal Section Comparison Experiments in the following.
Second, we extended our experiments to include channels with error probabilities in (0.5%, 1%, 2%, 4%, 8%, 16%) at a typical code rate of 0.67 (100/150). The NERs are presented in the following table, where each cell represents the NER of a model trained on the channel specified by the row header and tested on the channel specified by the column header.
The results demonstrate that the proposed method is compatible across different IDS error probabilities. As expected, models trained on channels with higher error probabilities exhibit compatibility with channels with lower error probabilities. In most cases, models trained and tested on similar channels achieve better performance. We will revise the manuscript adding such results.
| % | 0.5% | 1% | 2% | 4% | 8% | 16% |
|---|---|---|---|---|---|---|
| 0.5% | 0.68 | 1.59 | 4.26 | 11.67 | 26.87 | 45.61 |
| 1% | 0.52 | 1.15 | 2.90 | 8.12 | 21.19 | 41.03 |
| 2% | 0.67 | 1.43 | 3.16 | 7.79 | 18.70 | 36.89 |
| 4% | 1.25 | 1.76 | 2.88 | 5.53 | 12.39 | 28.31 |
| 8% | 2.74 | 3.24 | 4.30 | 6.62 | 12.20 | 25.41 |
| 16% | 11.57 | 11.93 | 12.61 | 14.40 | 17.22 | 25.51 |
Q2&3: Is applying the combinatorial single IDS correcting code directly good enough?
We apologize for not providing a clear comparison in the original manuscript.
When applying a single IDS-correcting code and treating the received sequence as the decoded sequence for failed decoding, the performance is directly related to the probability of multiple errors occurring. For example, consider 1% deletion on a 150-length sequence. The probability that more than one error occurs is approximately %. Moreover, treating the received sequence as the decoded sequence keeps the deletion shift, which introduces nearly half-length bits as errors on average. The final NER will be on the order of 10%.
We conducted experiments on both directly applying a combinatorial code and a segmented code. The results are presented in the following rebuttal Section Comparison Experiment. The numbers suggest that the performance of directly applying a combinatorial code is suboptimal.
Regarding the conventional VT code, we anticipate similar results to those observed with Cai’s code. This expectation arises because Cai’s code is an extension of the VT code, designed to correct single IDS errors on a 4-ary alphabet. Cai code and VT code behave the same in failing to correct multiple IDS errors and indel errors, respectively.
Q4: Analysis of the THEA code, and fundamental bounds.
We acknowledge that fundamental bounds and analyses, such as the minimum distance of the code, are important and a primary concern for researchers in coding theory.
The proposed code does not have the theoretical guarantees in correcting a single IDS error like the combinatorial codes. In fact, we observed instances where two codewords exhibit a Levenshtein distance of 2, which prevents the code from reliably correcting a single IDS error.
However, we emphasize that our goal is not to design theoretically guaranteed codes using deep learning. Instead, we aim to develop a code that achieves low NER when decoding messages through multiple-error channels. It is also challenging to pursue a theoretically guaranteed code via deep learning, as deep learning is known to operate as a black-box mechanism and is inherently difficult to explain.
Moreover, obtaining statistics on the distances between codewords is challenging due to the curse of dimensionality. In our typical setting, the length of the codeword is 150, and the cardinality of the codewords is astronomical . This immense cardinality makes it extremely unlikely to encounter similar codewords, further complicating efforts to analyze their distances comprehensively.
However, we still conducted experiments to gain some insights into the THEA code. In order to collect similar codewords, we generated a bucket of messages applying k-(substitution, insertion, or deletion) operations to the same original message. We then measured the Levenshtein distances between the corresponding codewords of these perturbed messages. The mean and standard deviation of these distances are summarized below.
| diff messages | no relation | 1-sub | 2-sub | 1-ins | 2-ins | 1-del | 2-del | 1-IDS | 2-IDS |
|---|---|---|---|---|---|---|---|---|---|
| diff codewords (in L-distance) | 82.562.91 | 1.500.64 | 2.950.87 | 55.3025.07 | 66.6117.15 | 50.7924.05 | 66.7316.61 | 33.3228.63 | 44.5428.91 |
It can be observed that two unrelated codewords typically have a Levenshtein distance of approximately 80. When messages have a smaller Hamming distance, their corresponding codewords also exhibit a lower Levenshtein distance. However, when messages are perturbed by insertions or deletions, their codewords are significantly farther apart in Levenshtein distance. This phenomenon may be explained by the fact that the model is trained by optimizing the reconstruction loss based on cross-entropy. Synchronization errors (indels) in the input message can cause a substantial increase in reconstruction loss, prompting the model to assign more distinct codewords to handle such cases effectively.
Q5: How does decoder work on codewords never saw in training?
This is an insightful question. If the decoder performs well, there are two possible hypotheses.
- Memory Hypothesis: The model might simply "remember" all the codewords and decode the message by leveraging this memory. In this case, the decoder's complexity would scale directly with the size of the codebook.
- Logic Hypothesis: The model may have learned an intrinsic logic or pattern that governs the codewords, allowing it to decode messages based on this abstract understanding.
How can we determine which hypothesis our model follows? This is a challenging task, as deep learning models are notoriously regarded as black-box mechanisms.
However, in our case, we affirm that the THEA code relies on abstract logic knowledge. This conclusion is based on the astronomical cardinality of the codewords.
Take the message length=100 as an example, there are astronomical codewords in THEA code, a quantity surpassing the estimated number of atoms in the entire solar system. It is impossible to feed all different samples to the model during training. In practice, the training samples and testing samples are generated randomly. By using different random seeds, these samples hardly overlap in the astronomical space, which means nearly all testing samples are unseen by the model during training.
The performance of testing samples is nearly identical to the performance of training samples (actually, even the training samples rarely overlap themselves since we always generate new samples). These observations strongly indicate that the model does not rely on the hypothesis of "remembering" codewords. This suggests that the model develops logic knowledge during training, enabling it to generalize effectively to new inputs.
Q6: What is the purpose to introduce NER*?
It is well known that deep learning models exhibit inherent randomness, with different training attempts often yielding models with varying performance. In Table 1&2, the NER results are reported as the mean standard deviation across 5 runs to demonstrate the model’s stability across different trials.
However, as an ECC, once the code is established, its performance is fixed and can be applied without further tuning. Therefore, we believe that the best performance reported in NER* is also valuable, as it provides a lower bound for the model’s potential best performance. This is somewhat similar to the choice of in the VT code, where different choices of in the VT code result in slightly different code rates.
Comparison Experiments
We must admit our fault for not providing sufficient comparison analysis in the original manuscript.
We would like to once again emphasize the challenges in conducting comparison experiments. Many related works are difficult to align with the same setting, and few have made their source code publicly available, often due to commercial reasons.
To evaluate the effectiveness of the proposed methods, we conducted comparative experiments against three prior works:
- A combinatorial code that can correct single IDS errors over a 4-ary alphabet:
- K. Cai et.al, Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE TIT
- A well-known, efficient heuristic method, HEDGES, in DNA storage:
- W.H. Press, et. al, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. PNAS
- A segmented method for correcting multiple IDS errors, called DNA-LM:
- Z.H. Yan, et.al, A segmented-edit error-correcting code with resynchronization function for DNA-based storage systems. IEEE TETC
These methods typically offer only a few discrete, fixed configurations. We made efforts to align their settings as closely as possible. For DNA-LM, we maintained the codeword length around 150, adjusting the number of segments to match code rates. For Cai’s combinatorial code, the code rates are fixed based on the code lengths. In our experiments, only the code rates are matched, with lengths at each code rate set as follows: 21@0.33, 32@0.50, 48@0.67, 102@0.83. It is important to note that code length plays a critical role in these experiments, as longer codewords are more likely to encounter multiple errors that cannot be corrected. Thus, Cai’s performance here is just a baseline statistic of multi-errors with respect to the length, and performance may degrade with increased length. For HEDGES, only binary library is publicly available, and it supports fixed code rates in [0.75, 0.6, 0.5, 1/3, 0.25, 1/6]. HEDGES’ inner code was tested independently for comparison.
The results are illustrated below and will be presented as a figure in the revised manuscript, as the code rates are not fully aligned. These numbers represent the decoding error rate at a 1% IDS channel error probability for these methods, with respect to the different code rates. In addition to the standard 1% error channel, we also conducted experiments on the ASC and DES channels.
| code rate | 0.33 | 0.50 | 0.6 | 0.67 | 0.75 | 0.83 |
|---|---|---|---|---|---|---|
| Cai | 0.44 | 1.00 | 2.53 | 8.65 | ||
| DNA-LM | 0.70 | 1.32 | 3.11 | 9.10 | ||
| HEDGES | 0.28 | 0.25 | 0.65 | 3.43 | ||
| THEA | 0.09 | 0.46 | 1.06 | 2.81 | ||
| Cai ASC | 0.11 | 0.41 | 1.29 | 5.72 | ||
| DNA-LM ASC | 0.93 | 1.71 | 3.69 | 10.20 | ||
| HEDGES ASC | 0.25 | 0.27 | 0.61 | 2.89 | ||
| THEA ASC | 0.05 | 0.30 | 0.79 | 1.83 | ||
| Cai DES | 0.48 | 1.49 | 3.36 | 11.10 | ||
| DNA-LM DES | 0.90 | 1.77 | 4.03 | 10.89 | ||
| HEDGES DES | 0.00 | 0.01 | 0.25 | 2.00 | ||
| THEA DES | 0.06 | 0.32 | 0.94 | 2.12 |
The results for Cai’s method indicate that directly applying classical combinatorial codes to a 1% IDS error probability channel with a codeword length of 150 is impractical. The observed error rates are high, even though these values were obtained with shorter code lengths than 150. The segmented method with sync markers in DNA-LM supports a codeword length of 150 and can correct multiple errors across different segments. However, it also exhibits a high error rate, indicating a nonnegligible likelihood of multi-errors occurring within the same segment. For HEDGES, while the results are commendable, the code rate is restricted to a limited set of fixed values. The results of THEA demonstrate the effectiveness of the proposed method. At lower code rates, THEA achieves a comparable error rate to HEDGES. At higher code rates, the proposed method outperforms HEDGES, achieving a lower error rate at a higher code rate, specifically 2.81%@0.83 for THEA vs 3.43%@0.75 for HEDGES.
I would like to thank the authors for providing the extended numerical results and comparisons! The results are indeed interesting.
At the same time the main question (generalization for the case of unseen codewords) is still not answered. Let me explain the concern. It was exactly the same problem for coding for AWGN channel and the answer was as follows: the neural network cannot generalize, i.e. it works very bad on the unseen codewords. Could you please investigate this question in more details? What is different in your setup?
I leave the current rating until this question is not addressed.
We sincerely thank the reviewer for the response.
We believe the main question regarding generalization is addressed in Rebuttal 3: Q5. The total number of possible messages or codewords in our setup is an astronomical . To generate the training and testing samples, we used a pseudo-random algorithm provided by NumPy, with different random seeds for each set. This ensures that the training and testing samples are separated within the vast sample space, meaning that the testing samples consist almost entirely of unseen codewords. Our experimental results demonstrate that the proposed method performs well on these unseen codewords, without any decline in performance during testing.
To address the concern regarding potential overlap between training and testing samples in the vast space, we conducted a new round of training and testing with explicit separation of these sets, after receiving this comment. Specifically, we converted the samples from NumPy arrays to strings using the type conversion function . We then computed the checksum of each sample using . Samples were assigned as follows: those with % were designated as training samples, while those with % were designated as testing samples. This setup ensures a strict separation between the training and testing datasets, aka, the model is evaluated on entirely unseen samples.
We conducted 3 rounds of training and testing with this configuration. The results, listed below, show no significant statistical difference compared to Table 1 (: ) in the manuscript.
| #1 | #2 | #3 | |
|---|---|---|---|
| NER(%) | 1.07 | 1.53 | 1.11 |
Dear Reviewer tENY,
As the discussion period is ending soon, if this concern remains unresolved, we welcome any additional feedback regarding the question of generalization for the case of unseen codewords or any other concerns.
We sincerely thank you for your valuable comments, which have greatly helped improve our manuscript a lot.
The Authors
The paper introduces THEA-Code: an autoencoder based method for designing error-correcting codes in DNA storage systems facing insertion, deletion, and substitution errors. In order to handle the difficulties brought by IDS channels with complex error patterns, the authors propose the two following key innovations: adding a Gumbel Softmax discretization constraint is to discretize the features of the autoencoder, and approximating the IDS channel with a differentiable transformer-based model.These techniques make it possible to generate tailored IDS-correcting codes that adapt to changing channel conditions, which provides better performance for DNA storage applications.
优点
Novelty: The paper proposes a new method by introducing a Gumbel-Softmax discretization constraint into an autoencoder for IDS-correcting codes generation.
Quality: The proposed differentiable IDS channel is a new concept that will help to enable gradient based optimization in the presence of non differentiable IDS operations.
Significance: Tailoring error correcting codes for individual IDS channels will have substantial impacts on DNA storage technology, which involves coping with complicated error patterns.
Clarity: The paper is clear on the proposed methods and experimental setup, which allows understanding of contributions and results.
缺点
Evaluation Depth: The experimental results could be more thorough. Comparisons with existing methods are mentioned but not detailed, so it is hard to fully evaluate the relative performance.
Generalization: This manuscript only works with specific settings for the IDS channels. It will be more helpful if one can show that THEA Code performs well for a broader range of realistic DNA storage scenarios.
Ablation Studies: Some ablation studies are presented, but a deeper analysis on how different hyperparameters could affect the behavior of the model might give further insight into its behavior.
问题
Can the authors provide comparisons in more detail with existing IDS correcting codes, both combinatorial and heuristic, under the same experimental conditions?
How does THEA Code perform using real-world DNA sequencing data, which may have error patterns more diverse than those simulated?
Could this approach scale to larger sequences used in DNA storage more commonly, and what might be some of the challenges?
We sincerely thank the reviewer for their valuable comments. Below, we provide detailed responses to address each concern. We hope these clarifications and additional insights will encourage the reviewer to consider raising the rating score.
Q1. Comparisons with existing methods.
We acknowledge our oversight in not including sufficient comparison analysis in the original manuscript. Although aligning experimental settings for established codes was challenging, we conducted comparative experiments involving Cai’s combinatorial code, the segmented code DNA-LM, and the well-known heuristic code HEDGES. The results demonstrate the commendable performance of the proposed method, which we believe further validates its effectiveness.
The reviewer may refer to the separately provided rebuttal Section, Comparisons Experiment, or the revised PDF, which will be submitted subsequently.
Q2. Generalization to broader IDS channel settings. & Real-world DNA sequencing data.
As demonstrated in the manuscript, the proposed method is designed to benefit from customizing codes for specific IDS channel settings. While it is true that a single trained model may not be optimal for all IDS channel settings—indeed, no code can achieve that—the proposed approach offers a method for the rapid development of customized codes tailored to different settings.
Prompted by the reviewer's concern, we realized that limiting the experiments to the 1% error channel and the HOM, ASC, and DES channels was insufficient to show the model's generalization to diverse channel settings. To address this, we have extended our experiments to include channels with error probabilities of 0.5%, 1%, 2%, 4%, 8%, and 16% at a typical code rate of 0.67 (100/150). The NERs are presented in the following table, where each cell represents the NER of a model trained on the channel specified by the row header and tested on the channel specified by the column header.
The results demonstrate that the proposed method is compatible across different IDS error probabilities. As expected, models trained on channels with higher error probabilities exhibit compatibility with channels with lower error probabilities. In most cases, models trained and tested on similar channels achieve better performance. We will revise the manuscript adding such results.
| % | 0.5% | 1% | 2% | 4% | 8% | 16% |
|---|---|---|---|---|---|---|
| 0.5% | 0.68 | 1.59 | 4.26 | 11.67 | 26.87 | 45.61 |
| 1% | 0.52 | 1.15 | 2.90 | 8.12 | 21.19 | 41.03 |
| 2% | 0.67 | 1.43 | 3.16 | 7.79 | 18.70 | 36.89 |
| 4% | 1.25 | 1.76 | 2.88 | 5.53 | 12.39 | 28.31 |
| 8% | 2.74 | 3.24 | 4.30 | 6.62 | 12.20 | 25.41 |
| 16% | 11.57 | 11.93 | 12.61 | 14.40 | 17.22 | 25.51 |
In the proposed customized code, the channels require a well-defined setting. For real-world DNA sequencing data, achieving a clear and unified depiction of such IDS channels remains an open research topic. Moreover, the realistic IDS channels are highly dependent on various wet-lab experimental conditions, the specific biochemical equipment employed, and other experimental factors. Unfortunately, conducting two rounds of wet-lab experiments—one for investigating channel characteristics and another for validating code performance—is beyond our current resources. Therefore, we hope that the reviewer will consider our work to be more aligned with machine learning and coding theory.
Q3. Ablation Studies and hyperparameter optimization.
We apologize for not including comprehensive ablation studies and hyperparameter optimization in the original manuscript.
This work involves two main hyperparameters: the weight of the auxiliary reconstruction loss and the temperature in the Gumbel-Softmax Discretization Constraint formula. We have included the ablation study for the auxiliary loss, weight , and the Gumbel-Softmax constraint in the original manuscript. However, the hyperparameter optimization for the temperature was not provided.
To address this gap, we conducted experiments with different choices of the temperature {}, while keeping other settings at their default values. The results are listed below.
| 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | |
|---|---|---|---|---|---|---|---|
| NER(%) | 5.37 | 3.41 | 2.70 | 1.15 | 7.63 | 21.57 | 69.82 |
This table illustrates that the lowest NER is reported at the default setting . When the value of is increased or decreased from this default, the NER worsens.
We will include a subsection or appendix in the revised manuscript displaying the training curves for different values of , as this will provide more informative insights than the table alone.
Q4. Scale to larger sequences used in DNA storage. & Potential challenges.
As our focus is on developing an IDS-correcting code, we kindly request to be excused from talking about the biochemical challenges associated with scaling up DNA storage.
To scale up DNA storage, two straightforward approaches may be considered: first, increasing the length of individual synthesized DNA strands; and second, increasing the number of synthesized DNA sequences.
The experiments in this work were conducted with a codeword length of 150 nt, which represents the typical sweet-spot length achievable by current biochemical techniques. For longer lengths within the same magnitude, based on our understanding of Transformer models, we believe the proposed method would perform effectively, making it adaptable to advancements in biochemical techniques anticipated in the coming years.
Regarding the enlargement of the number of synthesized DNA sequences, the proposed code offers sufficient capacity to meet this requirement. For the typical code rate of , there are potential codewords, a number exceeding the estimated count of atoms in the solar system, which can be used to encode messages.
The main challenge in increasing the number of DNA sequences lies in the complexity. The plain Transformer has a computational complexity of . Although is the length of the codeword and remains constant regardless of the number of codewords, this complexity becomes less ideal when dealing with a vast number of codewords to decode. Fortunately, numerous studies have proposed variants of Transformers with reduced complexity. Leveraging such variants may help mitigate this issue.
Moreover, given the parallelization capabilities of GPUs and the codeword length limited by current biotechnology, the time consumption of applying the Transformer is acceptable. We decoded 1,280,000 codewords on an RTX 3090, with the results listed below (the encoder, which shares the same structure, exhibits similar performance):
| message length | 50 | 75 | 100 | 125 |
|---|---|---|---|---|
| time (s) | 521.94 | 573.87 | 623.92 | 687.76 |
Comparison Experiment
We must admit our fault for not providing sufficient comparison analysis in the original manuscript.
We would like to once again emphasize the challenges in conducting comparison experiments. Many related works are difficult to align with the same setting, and few have made their source code publicly available, often due to commercial reasons.
To evaluate the effectiveness of the proposed methods, we conducted comparison experiments against three prior works:
- A combinatorial code that can correct single IDS errors over a 4-ary alphabet:
- K. Cai et.al, Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE TIT
- A well-known, efficient heuristic method, HEDGES, in DNA storage:
- W.H. Press, et. al, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. PNAS
- A segmented method for correcting multiple IDS errors, called DNA-LM:
- Z.H. Yan, et.al, A segmented-edit error-correcting code with resynchronization function for DNA-based storage systems. IEEE TETC
These methods typically offer only a few discrete, fixed configurations. We made efforts to align their settings as closely as possible. For DNA-LM, we maintained the codeword length around 150, adjusting the number of segments to match code rates. For Cai’s combinatorial code, the code rates are fixed based on the code lengths. In our experiments, only the code rates are matched, with lengths at each code rate set as follows: 21@0.33, 32@0.50, 48@0.67, 102@0.83. It is important to note that code length plays a critical role in the experiments, as longer codewords are more likely to encounter multiple errors that cannot be corrected. Thus, Cai’s performance here is just a baseline statistic of multi-errors with respect to the length, and performance may degrade with increased length. For HEDGES, only binary library is publicly available, and it supports fixed code rates in [0.75, 0.6, 0.5, 1/3, 0.25, 1/6]. HEDGES’ inner code was tested independently for comparison.
The results are illustrated below and will be presented as a figure in the revised manuscript, as the code rates are not fully aligned. These numbers represent the decoding error rate at a 1% IDS channel error probability for these methods, with respect to the different code rates. In addition to the standard 1% error channel, we also conducted experiments on the ASC and DES channels.
| code rate | 0.33 | 0.50 | 0.6 | 0.67 | 0.75 | 0.83 |
|---|---|---|---|---|---|---|
| Cai | 0.44 | 1.00 | 2.53 | 8.65 | ||
| DNA-LM | 0.70 | 1.32 | 3.11 | 9.10 | ||
| HEDGES | 0.28 | 0.25 | 0.65 | 3.43 | ||
| THEA | 0.09 | 0.46 | 1.06 | 2.81 | ||
| Cai ASC | 0.11 | 0.41 | 1.29 | 5.72 | ||
| DNA-LM ASC | 0.93 | 1.71 | 3.69 | 10.20 | ||
| HEDGES ASC | 0.25 | 0.27 | 0.61 | 2.89 | ||
| THEA ASC | 0.05 | 0.30 | 0.79 | 1.83 | ||
| Cai DES | 0.48 | 1.49 | 3.36 | 11.10 | ||
| DNA-LM DES | 0.90 | 1.77 | 4.03 | 10.89 | ||
| HEDGES DES | 0.00 | 0.01 | 0.25 | 2.00 | ||
| THEA DES | 0.06 | 0.32 | 0.94 | 2.12 |
The results for Cai’s method indicate that directly applying classical combinatorial codes to a 1% IDS error probability channel with a codeword length of 150 is impractical. The observed error rates are high, even though these values were obtained with shorter code lengths than 150. The segmented method with sync markers in DNA-LM supports a codeword length of 150 and can correct multiple errors across different segments. However, it also exhibits a high error rate, indicating a nonnegligible likelihood of multi-errors occurring within the same segment. For HEDGES, while the results are commendable, the code rate is restricted to a limited set of fixed values. The results of THEA demonstrate the effectiveness of the proposed method. At lower code rates, THEA achieves a comparable error rate to HEDGES. At higher code rates, the proposed method outperforms HEDGES, achieving a lower error rate at a higher code rate, specifically 2.81%@0.83 for THEA vs 3.43%@0.75 for HEDGES.
Thanks for your response! My rating remains the same. Please make sure to include your answers above in the final version if accepted.
Thanks for your response! My rating remains the same. Please make sure to include your answers above in the final version if accepted.
Thanks for your response! My rating remains the same. Please make sure to include your answers above in the final version if accepted.
Thanks for all the reviewers’ detailed feedback. We have carefully addressed the comments and provided a thorough response in our rebuttal and revised PDF. If there are any points requiring further clarification, we would greatly appreciate the opportunity to engage in a discuss.
We deeply respect that the reviewers' time is both limited and valuable. However, with the discussion phase concluding soon, we would greatly appreciate it if the reviewers could spare a moment to respond to the rebuttal. Your feedback would undoubtedly be of great benefit to our work.
This paper considers using the Gumbel-Softmax as a discretization constraint within an autoencoder framework designed for IDS error-correcting codes, specifically targeting DNA storage. The paper also presents a differentiable IDS channel that is designed to model the error patterns encountered in DNA storage, hence allowing the IDS code to be trained and tested under conditions mimicking actual IDS channels.
All reviewers provided excellent and balanced reviews mentioning the strengths and weaknesses of the paper. However, most of them were in agreement with one another that the scope of the contributions are rather limited due to several reasons. (i) Lack of realistic DNA storage channel modelling; (ii) Absence of fair comparison to state-of-the-art and (iii) Analysis of the discretization effect of applying Gumbel-Softmax in a non-generative model not particularly novel (from the viewpoint of its extensive use in VAEs); (iv) Use of the transformer architecture seems unnatural (should start with VAEs, demonstrate how the code can be use for a DNA channel, then introduce transformers).
I agree with these concerns, and hence assess that the paper does not cross the bar for publication at ICLR.
审稿人讨论附加意见
There was a robust but respectful exchange between the authors and the reviewers. The reviewers appreciated the efforts of the authors to clarify the novelty of the work and the additional experimental results. Reviewers also judged this papers as a machine learning paper in the context of ICLR, which is fair. However, at the end of the day, the reviewers remained unconvinced about the scope of the contributions, per the reasons given above. There were many suggestions by the reviewers to strengthen the paper for a future submission.
Reject