BAnG: Bidirectional Anchored Generation for Conditional RNA Design
BAnG is a Bidirectional Anchored Generation method specifically tailored for conditioned RNA design
摘要
评审与讨论
This paper proposes a deep learning-based model, RNA-BAnG, for designing RNA sequences that interact with specific proteins without requiring extensive experimental data or structural knowledge. Its core innovation, Bidirectional Anchored Generation, exploits the presence of functional binding motifs within broader sequence contexts. The model is validated on synthetic tasks with localized motifs and biological sequences to demonstrate its effectiveness for conditional RNA sequence design.
给作者的问题
N/A
论据与证据
-
Authors provided supporting experimental evidence to support their major claims.
-
The method is indeed novel. Specifically, the factorization of the joint distribution in Eqn. 1 and how the attention mask has been designed to execute the training steps in one-go is quite interesting. The authors claim on this has been properly justified in Section 2.1
-
My main concern is regarding the synthetic task for evaluation and comparison with other methods. The synthetic task designed here is interesting, however it is not clear if doing better on this task will directly transfer to more complex real-world tasks.
-
In the synthetic task, only sequences of length 50 are considered, which is quite short. Often sequence generative models perform much worse with longer sequences. I do not see any benefits of usig ROPE in this case since all the sequences are quite short.
-
It seems like all the sequences are 50 residues long (as mentioned in Section 3: "The synthetic data consists of nucleotide sequences, each 50 residues long"). If that is the case, then this is also a point of concern. Real-world sequences have a high variability of sequence lengths, which affect the evaluation of model's generalizability.
方法与评估标准
The method has novelty and the evaluation criteria were chosen properly for the application at hand.
理论论述
The authors introduce a factorization of the joint distribution in Eqn. 1 and a way to execute that with a particular attention masking strategy, that enables parallelization during training similar to autoregression. This is correctly provided and discussed properly.
实验设计与分析
The experimental design and analyses in this paper are mostly valid. However, the evaluation on synthetic data has some limitations as discussed above.
补充材料
In the appendix, the authors provided more details about the architectures, generated synthetic sequences, the algorithm for geometric attention, data processing steps, and some additional results and parameters of the other tools used. These give better context about the paper and are properly explained/depicted.
与现有文献的关系
The method introduced in this paper is related to the RNA sequence design literature as well as machine learning on RNA molecular data.
遗漏的重要参考文献
N/A
其他优缺点
I appreciate that the authors provided anonymized code for their method.
其他意见或建议
N/A
We sincerely appreciate the reviewer’s thoughtful comments and the ideas provided for additional experiments.
We understand that the synthetic task serves as a controlled environment to test the model’s performance, and we agree that the simplifications involved may not fully reflect the complexity of real-world scenarios. However, our intention was to provide a baseline evaluation to better understand the model’s behavior in a controlled setting before applying it to more complex, real-world tasks. We will clarify this in the revised manuscript to ensure that the purpose of the synthetic task is better understood.
The length of 50 was selected, as our main application goal is the design of aptamers, which are typically short (20-100 bases). While experiments with longer sequences will likely degrade the performance of other methods, this will not affect BAnG. ROPE (or an alternative positional encoding) is required to incorporate nucleotide order and relative distance in chain information. Furthermore, while variable sequence length may negatively impact the performance of other methods, it should have no influence on BAnG. These points are supported by the additional experiments we have conducted, the results will be provided in the revised manuscript and can be found here.
We are grateful for the reviewer’s suggestions, which have helped guide us in exploring these additional experiments. We hope the clarifications and new results will address the concerns raised and the reviewer will consider increasing their score. If there are any additional specific changes or analyses you would like us to address, we are more than happy to make further revisions.
The paper proposes a generative model which takes as input a protein structure and generates a potential RNA binder to that protein. The main methods contribution is a modification to usual left to right autoregressive generation which better fits the given task.
Post-rebuttal, based on authors' last comment
FID scores in image generation, refoldability metrics in protein design
FID is based on Inception (top model on ImageNet at the time) and re-foldability is based on AlphaFold2 (Nobel Prize). In my opinion, individually (per test set sample) trained DeepCLIP models are not similar/analogous to these models used in other ML domains. I'm not convinced that this manner of evaluation is appropriate or meaningful.
using RNA structure: we are confused by what the reviewer finds “not sufficiently convincing”
I'm just not convinced by claims and statements in the main paper PDF which say that the method does not use structural data.
I wrote previously: "Introduction claims that the proposed method does not rely on RNA structural data. Yes, that's technically true once the model is trained, but all the training data requires using RNA structures from the PDB based on my understanding, so this claim is not supported."
And the authors wrote back confirming my understanding to be correct - they do need the PDB and 3D structures in order to prepare their training datasets, which I think is one of the most important parts of applied ML models. So the claims are misleading in my opinion.
While we are happy to elaborate on data processing and other procedural details, the emphasis appears to have been placed on peripheral rather than essential elements.
I think evaluation metrics and training datasets are super important details of applied ML papers for scientific domains.
Additionally, some comments seem to focus more on wording than on technical substance. Although we provided detailed and direct responses to all raised points, these were not reflected in the final evaluation, as the original score was maintained.
I think one of our major jobs as reviewers is to carefully scrutinize the major claims that the authors make, which includes details/nuances of how they are worded. That's what I am doing in my review.
Based on the points outlined above, we believe the current score is not well supported by the substance of the review and does not accurately reflect the core contributions of the paper, particularly in the context of machine learning methods for sequence generation. We hope the reviewer might reconsider their evaluation in light of the clarifications and arguments provided.
I believe the score reflects my assessment of the paper at present. I am sorry if the authors are disappointed by this.
给作者的问题
Stated in the rest of the review.
论据与证据
I felt like I disagreed or did not find myself being sufficiently convinced by several claims here:
Introduction claims that the proposed method does not rely on RNA structural data. Yes, that's technically true once the model is trained, but all the training data requires using RNA structures from the PDB based on my understanding, so this claim is not supported.
There is the related claim (repeated throghout) that all other methods other than this paper require structures as inputs. The conclusion states "eliminating the need for extensive structural or interaction data...innovation significantly broadens the applicability of our method" -- but yet again, the model is trained based on PDB structural data on interacting proteins and RNAs, so I find this statement confusing.
Next, the claim (repeated throughout) that aptamer binding motifs are the only thing important for protein-RNA interactions and the rest of the RNA sequence does not matter/is less important -- this claim is presented as if universally true -- but firstly this should be supported by some sort of citation from basic science, and if this is not the case universally, some caveats should be presented here.
Introduction ends with a claim that experimental RNA-protein interaction data is used for evaluation. I am not sure that's what happens, as all the evaluation uses another ML model (DeepCLIP), trained on experimental data, to compute in-silico evaluation metrics. So the phrasing of this claim feels vague and imprecise. On a related note, all performance-related claims in the abstract/introduction feel purposely not quantitative (eg. simply saying "shows promising results", "outperform previous methods").
方法与评估标准
I don't think that the data is prepared in a sufficiently rigorous manner/I have questions I'd like clarified:
Data is split based on protein sequence homology. However, there can be a few known issues with such a split when working on biomolecule interactions. Firstly, while overall homology may be low, there may be cases where the homology of the interacting residues/positions is still high - and it seems like best practice to prioritise interacting positions instead when preparing splits. Additionally, structural homology may also be highly relevant here as you define interactions from a structural perspective/rely on the PDB. There have been several papers now pointing to such issues in biomolecule interaction data splits, and this one is perhaps the most recent resource: https://www.pinder.sh/
The appendix states the following about the data: "... protein mean length of 155 residues with a standard deviation of 90; RNA lengths average 1,834 nucleotides (± 1,564); DNA lengths average 76 nucleotides (± 78)."
Can you provide further information as to why the proteins have shorter lengths than RNAs in your dataset? Are many of these interactions from the ribosome? Are these super long RNAs from the ribosome?
Next, its stated: "To prevent potential computational resource issues and to focus the model on the binding motifs, we truncated nucleotide sequences exceeding 300 residues during training and validation."
Why do you think this is a biologically sensible choice/is there a scientific justification to doing this, or is it a purely ML-driven decision?
理论论述
Not applicable.
实验设计与分析
I feel very skeptical of the evaluation metric used here -- another machine learning model which is trained to predict protein-RNA binding affinity -- my intuition from reading literature is that such models almost never generalize to new datapoints beyond what's close to their training set, as they don't really learn the physics of protein-RNA interaction but are rather based on co-occurence statistics. Additionally, these models performance often looks good when presenting aggregate numbers (avg. across a test set), but in reality they often become very good at ranking/predicting for poor binders/negative fitness, while being almost random for the very, very few positive binders/positive fitness that are present in the train as well as test data. However, practically, we mostly care about their performance at identifying the positives, which they don't really tend to do in my experience.
I would really like the authors to provide significantly more justification of this most important experimental design choice rather than just referencing another recent paper.
补充材料
I briefly looked at the code but did not attempt to run it. I read the appendix.
与现有文献的关系
I think the paper is contributing to an area with growing recent interest. Generally designing biomolecular interactions and specifically RNA aptamers are a growing topic. But I think the paper and broader community need to do a far better job of dataset and evaluations being rigorous and biologically relevant.
遗漏的重要参考文献
Perhaps discussing recent advances in 3D and pseudoknotted RNA inverse design, eg. gRNAde (ICLR'25), RhoDesign (Nature Computational Science). I think a lot of interactions with binding partners are mediated via complex RNA tertiary structural elements like pseudoknots - and it may be worth connecting to how our ability to design RNA from a tertiary perspective is also improving (and is perhaps complementary to this work which takes a more implicit sequence-only approach).
其他优缺点
The synthetic experiments were well designed to highlight why the methodological contribution in interesting. I liked how they eased the reader into the much more complex experiments afterwards and how they connected with the methods section.
The paper should really be discussing the limitations of the method in more depth. The conclusion reads a bit too positive and makes no attempt to provide caveats (this is a matter of taste/opinion, but I think the study has some limitations esp. around how the data has been prepared and what evaluation metrics are being used).
其他意见或建议
Generally, we may want to target specific epitopes/binding sites on a protein. It may be worth discussing how the current method is limited in that respect.
Thank you for your feedback and for bringing attention to key aspects of our study. Below, we address your concerns and clarify key points regarding our methodology and evaluation.
RNA structural information from PDB was used only during data preprocessing to identify interacting nucleotides; however, it was not incorporated into the training process. Our model does not require RNA structure information for inference, nor does it internally predict or rely on it in any way. In contrast, some other methods depend on structural data for training, inference, or both, making our approach more broadly applicable, especially in cases where such information is unavailable. We acknowledge that certain explanations may have been challenging to interpret and will take this into account in our text revisions.
For training, we filtered proteins to include only those shorter than 500 residues, while RNA and DNA sequences were not subject to length restrictions (Appendix C.1). Consequently, proteins in our dataset tend to be shorter. Nucleotide sequences were cropped during training, leading to an effective training length of 205±114 residues, primarily to reduce computational memory requirements. Long RNA sequences are typically derived from the ribosome. Given that our ultimate goal is to generate aptamers—short RNA sequences—the cropping is appropriate, as our model doesn't need to learn long-range dependencies. We acknowledge that a sequence-based split is not ideal; however, it was applied solely during training to enhance data diversity and facilitate optimal weight selection, rather than to support any specific claims (Appendix C.1). Our test set contains proteins with less than 25% sequence similarity to the training data (Appendix F). Given that proteins with such low similarity are estimated to have less than a 10% chance of sharing a similar fold (doi.org/10.1093/protein/12.2.85), this suggests that our model demonstrates a degree of generalization across both structural and sequential spaces.
Evaluating generative models is inherently challenging, as direct comparison with ground truth is not possible. Additionally, to the best of our knowledge, there are no established computational protocols for assessing RNA affinity to proteins. Given these challenges, we opted to use a deep learning method for scoring, similar to practices in other domains like computer vision. Among available methods, we selected DeepCLIP due to several key advantages: it has undergone experimental validation, is retrainable for individual proteins, has demonstrated strong performance in benchmark studies (doi:10.1093/bib/bbad307), and does not rely on RNA structural information. We took extensive measures to verify its suitability (detailed in Appendix E.1), including rigorous training and testing of DeepCLIP models for each evaluation sample and selecting only those with exceptionally high performance metrics (AUROC > 0.95). While the reviewer suggests that such models may have a tendency for false positives and negatives, this would only imply that RNA-BAnG performs even better than our current estimates indicate. Nonetheless, we ensured that our DeepCLIP training and testing datasets contained an equal number of positive and negative samples to mitigate any potential biases (Appendix E.1).
The claim that aptamer binding motifs are the primary determinants of protein-RNA interactions, while the rest of the RNA sequence plays a lesser role, is supported by the findings of a prior study (doi.org/10.1038/nature12311). This paper, which provided the experimental data used in our evaluation, also states that RNA structure is not a critical factor in protein interactions for most of the analyzed samples.
Targeting specific protein binding sites with our model is currently possible through protein truncation. We did not incorporate a more direct method, since the primary protein-aptamer interaction experiment, SELEX, is not binding site specific as well. It has not been a priority for the current study, but it could be explored in our future work.
We appreciate the reviewer’s insights, which have helped us refine our explanations. We will also ensure that our model's limitations are properly discussed in the conclusion. Given the additional clarifications provided, we hope the reviewer will consider reassessing their score. If there are additional aspects you would like us to address or specific analyses you would like to see, we are open to further suggestions.
I acknowledge the rebuttal. I will be retaining my score and assessment of the work.
PDB was used only during data preprocessing to identify interacting nucleotides; however, it was not incorporated into the training process.
I would consider data preparation to be an important part of developing a model, so still found the claims made in the paper to not be sufficiently convincing.
I understand that the proposed model does not need structural information during inference.
Long RNA sequences are typically derived from the ribosome.
Why would this be justified, since the ribosome is a pretty special protein-RNA complex held together by interactions, and one wouldn't really consider those interactions as aptamer sites?
we opted to use a deep learning method for scoring ... rigorous training and testing of DeepCLIP models for each evaluation sample and selecting only those with exceptionally high performance metrics
I understand that you are evaluating on samples where the DeepCLIP method does exceedingly well on. However, why would we expect this to hold for de novo generated samples from a generative model?
imply that RNA-BAnG performs even better than our current estimates indicate
I don't understand why this would that be the case - can you elaborate?
Thank you for the reviewer’s feedback; below, we address the key points raised and provide clarifications.
Clarifications on Reviewer’s Concerns
Using RNA structure: we are confused by what the reviewer finds “not sufficiently convincing” regarding the role of RNA structure. To further clarify the difference between relying on RNA 3D structure and not: using RNA 3D coordinates from the PDB would mean dealing with unresolved residues, leading to fragmented sequences. Our model avoids this by using the full RNA sequences, as listed in the headers of the PDB files' CIF data, including all residues, not just the resolved ones.
Ribosome: Our training has benefited significantly from the inclusion of ribosomal and DNA data, as it greatly increased protein diversity. While, as the reviewer pointed out, protein-RNA interactions in the ribosome differ from those involving protein-aptamer complexes, we believe that the transfer of knowledge is feasible due to the similarity in the fundamental physical principles underlying both.
DeepCLIP evaluation: Our primary goal is to assess whether the generated RNA sequences share motifs with those found in experimental data. Since DeepCLIP is designed to identify nucleotide motifs already present in its training data using CNNs and BLSTM layers, it is well-suited to this task. Concretely, we trained individual DeepCLIP models for each sample of the test set. We do not expect DeepCLIP to detect binding motifs not present in the set of samples’ positive sequences. Thus, our reported metrics may even be an under-estimate of our method's success. This possible underestimation is a known consequence of our procedural design, which we accept as it enforces a stricter evaluation of RNA-BAnG’s performance.
We would like to highlight that our evaluation approaches, such as using a critic model to assess generated samples, are quite standart in the ML community (e.g., FID scores in image generation, refoldability metrics in protein design (arxiv:2312.00080)). We hope that our promising results will inspire further biological validation in experimental settings.
For the last point: we had a typo in our initial response regarding false positives and false negatives. What we meant to say is that if, as the reviewer points out, the DeepCLIP model is very good at detecting negative binders, but random on positive ones, then the corresponding evaluation score would only improve if we had a better model which also does well on positive binders. In this sense, if your intuition about DeepCLIP is true, then the current numbers would under-estimate the true performance of RNA-BAnG.
On reviewer’s assessment
We appreciate the reviewer’s engagement and feedback, though we feel that many of their comments focus on aspects not central to the scope or contributions of the paper. In particular, the discussion has shifted away from the core methodology, model, and results on synthetic tasks, which represent the main substance of this work. While we are happy to elaborate on data processing and other procedural details, the emphasis appears to have been placed on peripheral rather than essential elements. Additionally, some comments seem to focus more on wording than on technical substance. Although we provided detailed and direct responses to all raised points, these were not reflected in the final evaluation, as the original score was maintained.
Based on the points outlined above, we believe the current score is not well supported by the substance of the review and does not accurately reflect the core contributions of the paper, particularly in the context of machine learning methods for sequence generation. We hope the reviewer might reconsider their evaluation in light of the clarifications and arguments provided.
This manuscript presents RNA-BAnG, a deep learning model for generating RNA sequences that bind to specific proteins. The method involves a novel Bidirectional Anchored Generation (BAnG) technique, which generates RNA sequences by conditioning on protein sequence and structure. RNA-BAnG utilizes geometric attention to incorporate protein structural information, enabling effective RNA sequence design without requiring experimental data for the target protein. The model is validated on synthetic tasks and experimental RNA-protein interaction data, demonstrating superior performance compared to existing methods.
给作者的问题
Same to Other Strengths And Weaknesses
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes
实验设计与分析
Yes
补充材料
Yes
与现有文献的关系
All good
遗漏的重要参考文献
No
其他优缺点
This manuscript has several notable strengths:
1.Introducing RNA-BAnG, a novel deep learning-based model that generates RNA sequences for protein interactions without requiring experimental data or RNA structural information, making it widely applicable in various biological contexts. 2.Innovative BAnG method, which allows RNA sequence generation by conditioning on protein sequence and structure, addressing the challenge of protein-RNA interaction design more efficiently than existing approaches. 3.Demonstrating strong performance through comprehensive validation on synthetic tasks and experimental data, showing that RNA-BAnG outperforms other sequence generation methods in terms of both novelty and binding affinity.
Weaknesses:
1.Although the manuscript demonstrates the superiority of RNA-BAnG by comparing its performance with different methods, it does not directly evaluate the impact of the model's different modules or mechanisms through ablation experiments. It would be beneficial to include ablation studies to assess the specific impact of each component on the model's generation performance. 2.While the sequences generated by the model show good performance in terms of DeepCLIP scores, the actual biological functions and mechanisms of these binding motifs have not been thoroughly validated. It is recommended that the authors include a detailed analysis of the binding motifs in the manuscript to enhance the biological interpretability of the model. 3.In the section “2.1. Description of the BAnG generative approach”, it is suggested to change the reference format for the equations to "Eq. (1)".
其他意见或建议
Same to Other Strengths And Weaknesses
伦理审查问题
No.
Thank you for your helpful feedback. We appreciate your suggestions and have made adjustments accordingly. Below, we address the points you raised regarding our methods and analysis. Additionally, we are exploring ways to provide a more detailed analysis of the binding motifs, as suggested, to further improve the clarity and depth of our evaluation.
In response to your suggestion, we have conducted ablation studies to evaluate the impact of non-traditional architecture choices, such as positional encodings in Cross Attention and Geometrical Attention. These studies have demonstrated the significant contribution of these components to the model’s performance, and the results will be included in the Appendix of the revised manuscript for further clarity.
We hope that these changes address the concerns raised and improve the overall quality of our manuscript. Thank you again for your insightful comments and suggestions. If there are any additional specific changes or analyses you would like us to address, we are more than happy to make further revisions.
This paper proposes RNA-BAnG, a deep learning framework for conditional RNA design, focusing on generating RNA sequences that can bind to specific proteins. The core contribution is the introduction of the Bidirectional Anchored Generation (BAnG) method, which generates RNA sequences bidirectionally starting from anchor tokens placed within functional binding motifs, rather than from sequence ends as in standard autoregressive approaches.
update after rebuttal
Thanks for your additional experiments. I would be happy to raise my score, and I suggest that the authors include these results in the camera-ready version.
给作者的问题
- The bidirectional direction decoding approach is similar to ProteinMPNN. Can this method be extended to decode in any order, like proteinMPNN?
- Besides, BAnG should be effective for one motif. But how about multiple motifs? The important areas are also multiple.
论据与证据
Claim about outperforming RNAFLOW and GenerRNA broadly: While RNA-BAnG shows better performance in the provided case studies, RNAFLOW and GenerRNA were evaluated on a relatively limited test set. The RNAFLOW evaluation is constrained to proteins with no sequence similarity to the training data and without manually truncating proteins to binding regions, which might have impacted its performance.
方法与评估标准
The methodology and evaluation framework are well-aligned with the stated goals of conditional RNA design and provide a strong foundation for assessing model performance.
理论论述
The paper does not present formal theoretical claims or proofs.
实验设计与分析
The experimental design is largely sound and well-aligned with the goals of conditional RNA generation. The analyses are thorough, but the biological evaluation would be further strengthened with experimental validations or a broader baseline comparison scope.
补充材料
The supplementary material is comprehensive, particularly for clarifying model implementation, training regimes, and evaluation pipelines. It strengthens the transparency and reproducibility of the experiments. RNAFLOW baseline limitations could be discussed in more detail.
与现有文献的关系
By training the model to design RNA starting from the binding area, BAnG outperforms other models. The findings may also be useful for other bio-molecule designs.
遗漏的重要参考文献
No
其他优缺点
Strengths:
- The proposed BAnG outperforms previous works significantly.
- The idea of generating the import area of RNA first makes sense. Weaknesses:
- The proposed BAnG is only compared with RNAFLOW on several samples. A statistical metric comparison is required, e.g. binding affinity distribution.
- The effect of anchor point position selection is not discussed.
其他意见或建议
-
"Importantly, the method’s design makes it applicable beyond RNA-protein interactions, extending to any scenario where the focus is on optimizing functional subsequences within a larger sequence." I suggest the authors conduct more experiments on protein-protein or other generation tasks to demonstrate the application
- The authors use DeepCLIP to access the RNA-protein binding affinity. Can any other models or traditional computational approaches be used? I suggest the authors use more methods to access the binding affinity since there is not a commonly used model to access the affinity.
- I will raise my rating if more evaluation and comparison is conducted.
We appreciate the reviewer’s insightful comments. Let us now address the specific questions and concerns raised in the review regarding our evaluation choices and methodological decisions.
Expanding the comparison to additional baseline models is challenging because most existing models (cited in the paper) rely on large training sets of RNA sequences known to interact with a specific protein to generate new interacting sequences. However, such datasets are often difficult to obtain or may not exist at all, making our model—which operates without this requirement—more versatile and practical. This same constraint also prevents us from expanding the GenerRNA comparison set. Regarding RNAFLOW, its assessment on proteins with no sequence similarity to our training data was intended to make the comparison more rigorous for our model and should not have negatively impacted RNAFLOW’s performance. The decision to restrict the comparison set in this way was due to RNAFLOW’s long generation time and its clear limitations in the given settings, particularly its inability to operate on non-truncated proteins. Since truncation is not technically possible—binding sites are unknown, just as they typically are in real-world applications—our evaluation setup naturally reflects this constraint.
Both comparisons with baseline models are quantified using statistical measures. Each sample consists of 1,000 generated RNA sequences, and we evaluate performance by calculating the proportion of sequences that achieve high affinity scores (section 4.4). This ensures a robust statistical basis for our comparisons and allows for a meaningful assessment of model performance. To provide a more quantitative comparison, we computed the proportion of sequences above a threshold, similarly to Figure 6, and calculated areas under threshold-dependent performance curves for both RNAFLOW and GenerRNA. The results further strengthen our claim and can be found here.
Evaluating generative models is inherently challenging, as direct comparison with ground truth is not possible. Additionally, to the best of our knowledge, there are no established computational protocols for assessing RNA affinity to proteins. DeepCLIP is experimentally validated, does not rely on RNA structure (which is unavailable in our case), and has demonstrated strong performance in benchmark studies (doi:10.1093/bib/bbad307). To address evaluation challenges, we have added a more traditional approach by calculating sequence similarity using BLASTN. Results for this evaluation can be found here.
Regarding BAnG’s broader applicability, its extension to alternative biological tasks, such as protein-peptide interactions, is planned for future work as it requires substantial additional effort, particularly in data mining and preparation, making it beyond the scope of this paper.
While the effectiveness of BAnG for multiple mutually exclusive motifs was verified with the DoubleBind task (Table 2), its performance when multiple motifs appear simultaneously in a sequence depends on the uncertainty of their relative positioning. If motif placement is completely random, BAnG will face similar challenges to autoregressive models. However, if relative positions are fixed, BAnG’s performance remains stable. These points are supported by the additional experiments we have conducted, their results can be found here. Ultimately, BAnG is designed to reduce uncertainty in the target probability distribution, and increased noise (uncertainty) in the data makes learning more difficult for any model.
Anchor point selection plays a key role in minimizing uncertainty in the resulting factorization distribution. In synthetic tasks, any residue with a fixed distance from the synthetic motif could serve as an anchor point. However, in real-world scenarios, binding motifs vary in size and content, making random selection among interacting residues a practical approach to reduce uncertainty and diversify training data. In general, anchor point selection is driven by the goal of uncertainty reduction, but the optimal choice remains case-dependent.
Finally, regarding decoding strategies, ProteinMPNN’s sampling method is equivalent to the Iterative Max Logit strategy tested in our manuscript. However, extending BAnG to allow decoding in any order contradicts its core probability factorization and is therefore not feasible.
We are grateful for the reviewer’s thoughtful feedback, which have helped clarify key aspects of our approach and provided valuable ideas for additional experiments and evaluation metrics. We hope our responses address the reviewer's concerns and the reviewer will consider increasing their score. If there are any additional specific changes or analyses you would like us to address, we are more than happy to make further revisions.
The paper introduces RNA-BAnG, a model for designing RNA sequences that bind to specific proteins, which does not require to be trained on extensive experimental data of RNA sequences known to interact with target proteins or detailed RNA structural information. The model combines a novel generative method. BAnG. with a transformer based architecture enhanced by geometric attention mechanisms that incorporate protein structural information. The authors base their approach around the observation that RNA sequences binding to proteins often contain functional binding motifs, making it more effective to anchor sequence generation around the motif (i.e. initiating from binding sites) rather than sequence ends.
The authors validate their approach first on synthetic tasks and show that the proposed method preserves motifs more effectively than conventional autoregressive and masked iterative generation approaches. They then evaluate the model on real biological RNA-protein interaction datasets, where again RNA-BAnG outperforms existing models in generating high affinity sequences and is shown to generate sequences of high diversity and novelty.
给作者的问题
- Are there certain protein structural features or binding patterns that affect generation in a certain way e.g. make it more challenging?
- BAnG seems potentially applicable to other biomolecular design problems. Have other applications/domains been considered e.g. PPI or small molecule design?
论据与证据
The authors make several specific claims in the paper that warrant careful examination.
First, they claim that their method is particularly suited for RNA sequence generation compared to other generation methods. based on, as mentioned above, the observation that RNA sequences binding to proteins contain localized functional binding motifs. This is strongly supported by synthetic experiments such as SingleBind and DoubleBind, where RNA-BAnG achieves significantly higher motif preservation (97% in DoubleBind) compared to autoregressive (53%) and iterative generation methods (4-5%). This provides robust empirical validation of the claim in controlled settings. Then, they claim that their method can generate RNA sequences that bind to specific proteins without requiring extensive experimental data of RNA sequences known to interact with target proteins or detailed RNA structural information. This is also strongly supported, since the model is trained only using protein sequence, and its structure as predicted from AlphaFold, without experimentally validated RNA binding sequences. Furthermore, they do some analysis that suggests that the model generalizes beyond the training proteins and has learnt something more fundamental. While the analysis provided is not in depth, this is compelling evidence. The authors also claim that the geometric attention mechanism, which integrates protein structure information, is essential for their model to converge. However, they don't provide comparative experiments with ablations. This is a claim that would benefit from stronger evidence. finally, the authors claim that their method produces diverse and novel RNA sequences with high predicted binding affinity to target proteins. They provide quantitative metrics: high sequence diversity (0.93 ± 0.13) and novelty (0.99 ± 0.01), calculated as the proportion of generated sequences that do not appear in the training set. The results show that the model consistently generates sequences with higher predicted affinity scores than GenerRNA, and achieves superior performance to RNAFLOW across most test proteins (although the authors acknowledge possible methodological differences here).
A notable limitation in the evidence is the reliance on DeepCLIP scores as a proxy for binding affinity rather than experimental validation. The authors acknowledge this limitation but demonstrate DeepCLIP's reliability by showing its clear discrimination between experimentally derived positive and negative binding sequences (area values of 0.88 vs. 0.11). This suggest that it could be a reasonable proxy for this evaluation, though as the authors acknowledge in the conclusion, experimental validation would strengthen the practical applicability of the generated sequences.
方法与评估标准
The paper introduces the BAnG generative approach and the RNA-BAnG model architecture. BAnG introduces a novel factorization of the joint distribution over an RNA sequence which enables bidirectional sequence generation from a central anchor point (rather than sequentially from sequence ends), which as explained before, exploits the biological reality that RNA sequences binding to proteins contain embedded functional motifs. Specifically, it defines two special anchor tokens that represent the left and right boundaries of the binding region, and generates sequence tokens alternately in both directions, extending outward from the anchors, using a custom bidirectional attention mask to correctly model dependencies. The RNA-BAnG architecture itself is an architecture with two modules reminiscent of a traditional transformer encoder decoder, with a few key changes. First there is a protein module that encodes protein sequence and structure, using standard self attention and a geometric attention based on the invariant point attention mechanism from AF2 to process the structure information. Then there is a nucleotide module which generates RNA sequences conditioned on protein representations. First, it encodes RNA sequences with self attention (with special handling for anchor tokes), and then it incorporates the protein representations from the protein module with a cross attention block. The training procedure consists of two phases (1) pretraining on non coding RNA sequences to learn general RNA properties, and (2) finetuning on protein-RNA interaction datasets to learn protein conditioned sequence generation.
Evaluation: the model is evaluated across different tasks - synthetic motif reconstruction tasks (SingleBind and DoubleBind) and biological evaluation based on DeepCLIP.
Overall, there are no real comments to be made in this section. The methodology seems sound and well argumented. The evaluation criteria also seem well suited to this problem, with the obvious caveat that DeepCLIP is used as a proxy for binding affinity. The authors support this choice by showing that DeepCLIP reliably distinguishes positive and negative experimental binding sequences. The additional diversity and novelty metrics are critical since in RNA design it's useful to generate a broad range of candidates instead of just optimizing affinity. The benchmark datasets are appropriate for the task also. Indeed, the only thing that requires some attention would be the lack of experimental validation which the authors include as further work in the conclusion.
理论论述
not applicable
实验设计与分析
As explained previously, the experimental design is generally sound, with well-defined synthetic and biological tasks, and comprehensive baseline comparison. The synthetic experiments effectively isolate the impact of bidirectional generation on motif preservation, and comparisons with autoregressive and iterative generative models are appropriate. The biological experiments leverage real interaction datasets and AlphaFold2-derived protein structures to test RNA sequence generation. However, while the model avoids direct training on experimentally derived RNA-protein interaction data, it still depends on AlphaFold2 predictions, which may introduce biases.
Some things to highlight:
- For DeepCLIP, the performance gap between positive and negative experimental sets (area values of 0.88 vs. 0.11) validates DeepCLIP as a reasonable proxy.
- For the comparison with GenerRNA, the authors match the test conditions by using the same proteins and filtering sequences to match length constraints.
- For RNAFLOW comparisons, they selected proteins with zero sequence similarity to the train set to ensure a fair assessment of generalization.
Thus, considerable efforts are made to set up fair comparisons in their experiments.
补充材料
Yes. The supplementary material is comprehensive and very valuable if additional detail is sought. The illustrations clarify some points of the main manuscript and there is lots of detail on data processing, model architecture and experimental conditions and hyperparameters, which helps reproducibility. Finally, there are extended performance comparisons and examples of generated sequences with some further illustrations.
与现有文献的关系
The work connects a few different areas of the literature and extends recent advances in biomolecular design. The BAnG method itself represents an innovation in sequence generation that differs from standard autoregressive models like those commonly used in NLP, and from iterative generation methods like those employed in ESM3. The approach is particularly novel in its focus on bidirectional generation from a central anchoring point. The work relates to the broader literature on RNA-protein interactions, particularly studies focusing on the identification and characterization of RNA binding motifs. the learnings from which are directly incorporated into the model and training design. The geometric attention builds directly on invariant point attention from AF2 and the paper demonstrates its effectiveness in this domain.
遗漏的重要参考文献
The paper briefly mentions RNA structure prediction tools but could reference specific methods like RNAFold or SPOT-RNA or more recent deep learning approaches for RNA structure prediction to provide context for why existing structural information might be insufficient. Also, the work on protein LMs that has become of increased relevance in the literature lately can give additional references for how to incorporate structure information into these attention based models. Finally, potentially some advances in multimodal approaches for biomolecular interactions could be discussed as relevant literature.
其他优缺点
The paper is well written and the illustrations are very clear and helpful. There is lots of content for the reader. The code and model weights are open sourced.
Some points to note that would make the paper stronger:
- The explanation of the BAnG method could benefit from a more intuitive description or an illustration before diving into the mathematical formulation, to help readers unfamiliar with sequence generation methods.
- The authors mention that RNA sequences often contain functional binding motifs embedded within larger sequence contexts but don't provide much detail on how these motifs typically manifest in RNA-protein interactions. A brief explanation of the typical characteristics of these motifs would strengthen the biological motivation for the approach.
其他意见或建议
not applicable
We appreciate the reviewer’s thoughtful feedback and suggestions. We will add the necessary citations and work on further clarifying our methods and biological explanations to ensure the content is more accessible.
In response to the reviewer’s request, we have conducted ablation studies to assess the impact of non-traditional architecture choices, such as positional encodings in Cross Attention and Geometrical Attention. These studies have demonstrated the valuable contribution of these components to the model’s performance, and the results will be included in the Appendix for further clarity.
We agree that an explanation of RNA-binding motifs would strengthen the biological motivation for our approach. Aptamer RNA binding motifs are short, conserved sequence patterns that serve as recognition sites for RNA-binding proteins (RBPs). Many RBPs specifically recognize distinct nucleotide sequences, such as AU-rich or GU-rich elements, while others rely on secondary structures like hairpins and stem-loops for recognition, though these cases are relatively rare. We will incorporate a brief discussion of these motifs in the revised manuscript, referencing doi.org/10.1038/nature12311, which also provides the evaluation data used in our study.
We have additionally explored the effects of protein data preprocessing on model performance. Our observations indicate that segmenting the protein structure into domains or removing intrinsically disordered regions (IDRs) can improve generation results in certain cases. However, such preprocessing is only feasible if meta-information about protein-RNA interactions is available, as RNAs may interact with multiple domains or IDRs simultaneously. We will add this information into the revision.
Lastly, regarding BAnG’s broader applicability, we plan to extend its use to other biological tasks, such as protein-peptide interactions. However, this would require substantial additional effort and is beyond the scope of the current paper.
We greatly appreciate the reviewer’s valuable comments, which have helped us improve the clarity of our explanations. We hope the added details and planned additions will address the concerns raised by the reviewer and enhance the overall quality of our paper. As the requested revisions were focused on improving clarity, we hope the reviewer will consider increasing their score. If there are any additional specific changes or analyses you would like us to address, we are more than happy to make further revisions.
This paper has a wide range of reviews, from five reviewers; one "reject", three "weak accept", one "accept". Two reviewers raised their scores following the rebuttal period, during which the authors provided additional ablation experiments that strengthen the paper.
The reviewer arguing to reject had some concerns, which I do think are quite valid and shouldn't be dismissed. One of these could largely be addressed by writing; RNA structure data is not an input to the model, but is required as part of preparation of the training data and the paper should be honest about this.
The second concern is regarding evaluation using DeepCLIP. I think this is a fair criticism, and it was shared by some of the other reviewers as well: unlike some of the other settings the authors cite, where e.g. AlphaFold has been well shown to provide good performance and generalisation in a wide variety of contexts, the DeepCLIP models are trained on only a handful of binding partners and as such we do not know or necessarily expect them to be accurate in predicting binding for the (possibly o.o.d.) samples from the proposed model.
As such, I think it is fair to say that the DeepCLIP evaluation metric is unreliable here, or at least unproven. However, I would still recommend acceptance, given the novelty of the method and its appropriateness to the problem at hand, and the fact that the majority of the reviewers are happy to accept the paper despite these issues. Additionally, there was not a clear alternative computational approach suggested by the reviewers that would be an appropriate alternative.