PaperHub
5.5
/10
Rejected4 位审稿人
最低3最高8标准差1.8
3
8
5
6
4.3
置信度
正确性3.0
贡献度2.5
表达2.5
ICLR 2025

Learning the Language of Protein Structure

OpenReviewPDF
提交: 2024-09-13更新: 2025-02-05
TL;DR

We propose a new model to learn protein structures in a discrete fashion enabling concise representation and generation.

摘要

关键词
Strucutral Biology; Quantized Representation; Generative Modeling

评审与讨论

审稿意见
3

This paper, "Learning the Language of Protein Structure," proposes a new approach to protein structure modeling by representing the continuous 3D forms of proteins in a discrete, sequence-like format. This work tries to bridge protein sequence and structure modeling using techniques drawn from natural language. The authors introduce a vector-quantized autoencoder to translate 3D protein structures into discrete tokens. This tokenization leverages a codebook of 4096–64000 tokens, achieving effective compression while retaining high fidelity in structure reconstruction with root mean square deviations (RMSD) between 1–5 Å. The learned discrete representations are applied to generate novel and diverse protein structures using a simple GPT model, which demonstrated competitive performance in designing structurally viable proteins against established diffusion-based models.

优点

  1. The paper is generally well-structured and clearly written.
  2. The use of vector-quantized autoencoders combined with transformer-based NLP models is an interesting application of NLP methodologies to the domain of protein structure.
  3. The application of finite scalar quantization (FSQ) to manage codebook stability issues shows creativity in overcoming known challenges in quantized representation learning.
  4. The authors explore their model’s capacity with both qualitative and quantitative metrics. Ablation studies and variations in codebook size and downsampling ratios.

缺点

  1. Tokenization for protein structures is not entirely novel; works like FoldSeek and several others [1-10] have experimented with tokenized representations. Although the paper claims novelty in its discrete, sequence-based approach, it could better establish its unique contributions by clearly differentiating its tokenization strategy from these previous methods. To strengthen the paper’s contribution, the authors should clarify and empirically demonstrate the advantages of their approach over existing methods.
  2. The primary application explored in the paper is protein structure generation using a GPT model. While this is promising, the paper could be more impactful if it included additional downstream applications to show the broader utility of its tokenized representation.
  3. The technical novelty is quite limited. Almost all the components of this approach are built on existing techniques, like MPNN, FSQ quantization, AlphaFold's structure module, and FAPE loss.

[1] van Kempen, Michel, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. "Foldseek: fast and accurate protein structure search." Biorxiv (2022): 2022-02.

[2] Trinquier, Jeanne, Samantha Petti, Shihao Feng, Johannes Söding, Martin Steinegger, and Sergey Ovchinnikov. "SWAMPNN: End-to-end protein structures alignment." In Machine Learning for Structural Biology Workshop, NeurIPS. 2022.

[3] Lin, Xiaohan, Zhenyu Chen, Yanheng Li, Zicheng Ma, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Yi Qin Gao, and Jun Zhang. "Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary." bioRxiv (2023): 2023-11.

[4] Gao, Zhangyang, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, and Stan Z. Li. "Foldtoken: Learning protein language via vector quantization and beyond." arXiv preprint arXiv:2403.09673 (2024).

[5] Li, Mingchen, Yang Tan, Xinzhu Ma, Bozitao Zhong, Huiqun Yu, Ziyi Zhou, Wanli Ouyang, Bingxin Zhou, Liang Hong, and Pan Tan. "ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention." bioRxiv (2024): 2024-04.

[6] Gaujac, Benoit, Jérémie Donà, Liviu Copoiu, Timothy Atkinson, Thomas Pierrot, and Thomas D. Barrett. "Learning the Language of Protein Structure." arXiv preprint arXiv:2405.15840 (2024).

[7] Hayes, Tomas, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil et al. "Simulating 500 million years of evolution with a language model." bioRxiv (2024): 2024-07.

[8] Lu, Amy X., Wilson Yan, Kevin K. Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, and Nathan Frey. "Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure." bioRxiv (2024): 2024-08.

[9] Wang, Xinyou, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. "DPLM-2: A Multimodal Diffusion Protein Language Model." arXiv preprint arXiv:2410.13782 (2024).

[10] Lu, Jiarui, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. "Structure Language Models for Protein Conformation Generation." arXiv preprint arXiv:2410.18403 (2024).

问题

I'm wondering about the key advantages of this work compared to the listed works above. What's your unique contribution?

评论

We thank the reviewer for acknowledging interest in our work and are grateful he appreciates the narrative and experimental part of our work.

We address the reviewers' remarks point by point and hope to be able to discuss.

Tokenization for protein structures is not entirely novel; works like FoldSeek and several others [1-10] have experimented with tokenized representations. Although the paper claims novelty in its discrete, sequence-based approach, it could better establish its unique contributions by clearly differentiating its tokenization strategy from these previous methods. To strengthen the paper’s contribution, the authors should clarify and empirically demonstrate the advantages of their approach over existing methods.

Indeed, structure tokenization is not entirely novel and some of the methods reported in your references [1-10] such as Foldseek are already discussed in our work and we explain why our approach is complementary to [1], as they only treat protein sub-structures.

Moreover, while we concur with the reviewer that we missed [2] which indeed performs a tokenization of structures, we highlight that all subsequent works [4-10] were not pre-printed at the time of writing of the manuscript.

Nonetheless, we agree with the reviewer that providing the reader with a fair overview of all methods is essential, in that sense we updated the manuscript to include all references to our related works on protein structure tokenization.

The technical novelty is quite limited. Almost all the components of this approach are built on existing techniques, like MPNN, FSQ quantization, AlphaFold's structure module, and FAPE loss.

Our primary objective in this study is to propose an efficient structure representation and decoding, to do so we indeed leverage existing techniques, combining them in a novel way that addresses the specific challenges in protein structure modeling.

By carefully integrating these components, we were able to propose a model that achieves significant advances in tokenizing, reconstructing, and generating protein structures with high fidelity and efficiency.

Our carefully designed experiments show the relevance of our work. Therefore, we firmly believe that the use of efficient, principled and well tested approaches is in that sens valuable to the community.

评论

I appreciate your response, but I must admit I’m not fully satisfied with it. To clarify, I’m not asking you to compare these approaches in your experiment section; my request specifically pertains to the related work section. It’s important that you highlight the unique advantages of your method in relation to prior work.

Regarding your statement that “all subsequent works [4-10] were not pre-printed at the time of writing of the manuscript,” I believe referencing the time of submission would be more appropriate and precise than referencing the time of writing.

To be clear, I’m fine with the manuscript being accepted or rejected, but my main concern is that the authors clearly and transparently present their unique contributions. It’s important to acknowledge that your work is not, in fact, the first in this area. The published FoldSeek and the unpublished FoldToken both precede your work. If they are there earlier than yours, do you have any special contributions in this field? Ultimately, the completeness and depth of a contribution may hold more value than its chronological timing.

I believe that explicitly situating your contribution in this context while highlighting its distinctiveness would strengthen the manuscript significantly.

评论

We appreciate the reviewer’s thoughtful feedback and agree that providing extended context to situate our work is both important and valuable. We also acknowledge the significance of clearly highlighting our contributions and differentiating our work from prior approaches, and we updated the manuscript accordingly with a more comprehensive related works section.

While FoldSeek [1] is one of the earliest works adopting quantization for protein structures, it only focuses on learning a local structural alphabet to describe residue-level features. Despite being highly effective for tasks such as dataset lookups and structure alignments, its inability to provide global representations of protein structures limits its applicability in tasks requiring global information, such as structure generation or binding prediction.

We also acknowledge the concurrent FoldToken [2] work, which we initially omitted. While FoldToken shares some conceptual similarities with our approach, it differs in several important ways that we believe underscore the unique contributions of our work. FoldToken employs joint quantization of both the sequence and the structure, whereas our method exclusively focuses on structural information. This decoupled strategy that we adopt, allows for mode-specific pretraining, as done in subsequent works [3, 4].

Additionally, our method introduces a novel ability to compress structural representations through a downsampling ratio: rr via a cross-attention mechanism, as detailed in the manuscript. This capability, along with the exhaustive investigations we conduct, represents in our view, a significant contribution in structural representation learning. From a methodological perspective, FoldToken introduces a series of improvements to existing VQ methods aimed at enhancing reconstruction accuracy. By contrast, we adopt the FSQ framework, which reduces the straight-through gradient estimation gap inherent to VQ methods while improving codebook utilization. These algorithmic choices not only distinguish our approach but also contribute to the robustness of our framework.

We agree that explicitly situating our contributions in this context strengthens the manuscript. We revised our manuscript to reflect these distinctions and provide a transparent discussion of FoldToken as an important concurrent work and will try to make it even clearer in the camera ready version. Concretely, whilst FoldSeek is already explicitly discussed in our Related Work section, we have summarised the above into an extended discussion of FoldToken, which we present here for completeness.

FoldToken is a concurrent approach that shares conceptual similarities with our work but differs in key methodological and application-oriented aspects. FoldToken employs joint quantization of sequence and structure, enabling integration across modalities, whereas our method focuses exclusively on structural information. This decoupling allows for mode-specific pretraining, aligning with strategies from subsequent works [3, 4]. Methodologically, FoldToken introduces a series of improvements to existing VQ methods aimed at enhancing reconstruction accuracy; whereas we adopt the FSQ framework, which reduces the straight-through gradient estimation gap inherent to VQ methods while improving codebook utilization. Furthermore, while FoldToken primarily emphasizes backbone inpainting and antibody design, our work considers de novo generation of complete structures.

Thank you again for your valuable feedback, which we believe has significantly improved the clarity and rigor of our manuscript.

[1] van Kempen, Michel, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. "Foldseek: fast and accurate protein structure search." Biorxiv (2022).

[2] Gao, Zhangyang, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, and Stan Z. Li. "Foldtoken: Learning protein language via vector quantization and beyond." arXiv (2024)

[3] Hayes, Tomas, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil et al. "Simulating 500 million years of evolution with a language model." bioRxiv (2024).

[4] Lu, Jiarui, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. "Structure Language Models for Protein Conformation Generation." arXiv (2024).

评论

We would like to make sure if the reviewer have had the opportunity to review the additional elements and analysis provided notably regarding the your latest comments, further highlighting the merits and differences of our approach.

We have addressed in this rebuttal many of the reviewer's concerns, notably regarding the related work, including a thorough discussion both in the manuscript and in the rebuttal, and detailing the specificity of our work.

We'd appreciate to hear your feedback.

审稿意见
8

This paper proposes to discretize protein backbone structures so that we can use the well-developed machinery for modeling sequences of discrete tokens to analyze and generate protein structure. The authors train use a ProteinMPNN-style embedder, a cross-attention based discretizer, and an AlphaFold folding-module based decoder to parametrize finite scalar quantization of protein backbones, showing that it is possible to achieve accurate reconstructions. Finally, they train a simple autoregressive transformer to generate protein backbone structures in this discrete latent space.

优点

The paper makes a clear and strong case for discretizing protein backbone structures in order to use discrete sequence models to analyze and generate protein backbones. While the individual parts (discretizing protein structure, generating protein structures, using sequence modeling architectures on discretized protein structures) are present in previous work, this work is unique in its focus, thoroughness, and attention to detail on the quality of the discretization and reconstruction. The main claims are well-supported by the experiments, which do a good job of following the current standards in the field. The paper generally flows very well, and the main points and contributions are clear.

缺点

The two biggest areas for improvement are in explaining the paper's significance in light of the (missing) related work in this area and on the completeness of the ablations and metrics reported. I think this could be a really strong paper if most of these weaknesses are addressed.

Related work

The authors should mention and describe how their work is different from ProTokens, which also aims to discretize protein structures with high reconstruction accuracy, and FoldToken2, which also discretizes protein structures and trains (with very sparse details) flow matching and autoregressive generative models on those tokens.

There are also other works that train sequence models on tokenized structures. SaProt is a masked language model trained on both sequence and structure tokens, while ProstT5 uses tokenized structures to unify structure prediction and inverse folding into a single model. ESM3 trains their own structure tokenizer and claims to be able to co-generate sequence and structure. The paper should mention these related works and explain how it is distinct and/or builds on this prior work.

To be clear, I believe that this paper does make a significant contribution even in the light of these related works. However, not all readers will know the related literature quite as closely, and the paper would be greatly improved by properly situating itself in this context.

Experiments

  1. While Table 1 nicely summarizes reconstruction quality, we need to know more than the mean RMSD and TM. At a minimum, we need to see distributions of reconstruction quality. It would be even nicer to have an analysis of what kinds of structures are reconstructed well and poorly.
  2. Relatedly, the authors use a random split for training and evaluating the quantization and reconstruction. This means that there are likely test structures that are very similar to training structures. The authors should evaluate how well the reconstruction generalizes for structures that are more/less similar to those seen in training.
  3. While it's true that backbone diffusion models generally list the proportion designable, then the proportion of the designable backbones that are novel, then the proportion that are diverse, this isn't what we really care about for unconditional generation. What we actually want is the number of designable backbones that are distinct from each other and from natural backbones, in other words, the number (or proportion) of total backbones generated that are designable and novel and diverse. Given that the models are trained on the PDB, it would be better to do the novelty assessment against the PDB than against CATH.
  4. Another metric for reconstruction quality especially relevant for protein design is whether the reconstructed backbones are predicted to be realized by the original sequence. This could be evaluated, for example, by obtaining the perplexity of the original sequence when designing the reconstructed backbone using ProteinMPNN.
  5. The paper claims that we want to enforce locality with a special attention mechanism during tokenization. This should be ablated.
  6. It would be interesting to see how using a different-sized codebook affects the quality of the downstream generative model. Do we always just want the biggest codebook we can afford?

Other weaknesses

  1. Algorithm 2 needs to be expanded so that the reader has some reasonable chance of being able to implement it. I do not see how the current description facilitates down- or up-sampling.
  2. The paper describes the training hardware for the quantization model but not for the generative model. The training time is not described for either model.
  3. While the paper makes it clear later, the initial section describing the token decoder should explicitly state that it uses the AlphaFold2 structure module architecture.

问题

In addition to the suggestions in the Weaknesses section:

  1. How much compute was utilized to train the full model (tokenizer + generator)? For reference, in the United States, the White House recently issues an executive order saying that models trained on mostly biological sequences for more than 1e23 FLOPs need to be reported to the federal government. It's unclear whether protein structures count as biological sequences. By my estimate, 128 TPU v4-8s give a max of 140800 teraflops, so assuming a utilization of 50%, you'd get to 1e23 FLOPs after about 396 hours.
  2. It'd be interesting to know whether the generated length distribution matches that of the training set. It'd also be interesting to see how the designability/novelty/diversity metrics look if the baseline models were sampled from at exactly the same empirical distribution as the samples from the tokenized generator.
  3. I'm curious why the length vs scRMSD isn't monotonic for this model is for diffusion and flow matching backbone models.
评论

We first would like to thank the reviewer for its thorough and insightful review. We address its remarks point by point and hope to engage in a fruitful discussion.

On the related work

Thank you for the constructive feedback on the related work. We agree that situating our work within the latest developments is essential for reader clarity, and we appreciate the reviewer’s acknowledgment of our contribution. We also address this remark in our comment for all reviewers. In response, we expand our related work section to include ProstT5 as an example of using FoldSeek local structure tokens to incorporate structural information in language modeling. We also clarify that SaProt is already discussed in our current submission. The rapid pace of progress in this field makes it challenging to propose a comprehensive overview. At the time of writing, newer models such as FoldTokens2, ProTokens, and ESM-3 had only recently emerged and are still unreviewed preprints. Nevertheless, we commit to incorporating these recent works to provide readers with a fuller context, while underscoring that our contributions remain significant and complementary in light of these advances.

Relatedly, the authors use a random split for training and evaluating the quantization and reconstruction. This means that there are likely test structures that are very similar to training structures. The authors should evaluate how well the reconstruction generalizes for structures that are more/less similar to those seen in training.

We concur with the reviewer on its view. To address the reviewers’ concern, we retrain all our algorithms on novel splits leaving out some full clusters for validation and test. That way we would better understand the generation capacity of the proposed model. The detailed results are available in the response to all reviewers.

Note that, to further assess the capacity of our model, we also now include reconstruction results on the CASP15 dataset. This dataset, recognized by the community as including complex proteins, includes structures that are not included in the dataset used for training.

The detailed results are now available in the main paper and in the answer to all reviewers. Put simply, the results obtained on the CASP15 dataset are inline with the results obtained on the test. For instance we report for all training with a downsampling factor below two a median RMSD below two Angstroms.

At a minimum, we need to see distributions of reconstruction quality. It would be even nicer to have an analysis of what kinds of structures are reconstructed well and poorly.

We concur with the reviewer. We update the manuscript with a distribution of the errors on both CASP15 and the newly built dataset split and discuss them both in the main body of the paper as well as in the appendices.

While it's true that backbone diffusion models generally list the proportion designable, then the proportion of the designable backbones that are novel, then the proportion that are diverse, this isn't what we really care about for unconditional generation. What we actually want is the number of designable backbones that are distinct from each other and from natural backbones, in other words, the number (or proportion) of total backbones generated that are designable and novel and diverse. Given that the models are trained on the PDB, it would be better to do the novelty assessment against the PDB than against CATH.

Indeed, the proportion of both novel and designable is indeed a crucial metric when addressing de-novo generation, we want to highlight that the purpose of our generative experiments is to showcase that our learned tokenizer provides an informative description of proteins structures, not to build a new state-of-the-art generative model for protein structures. Our goal is to show that a generative model learned only using simple pretraining methods conducted on our learned representation is indeed already competitive with existing benchmarks. While we concur that the novelty would be better estimated on the full PDB dataset rather than only on the CATH, this raises a significant computational challenge running a one on one comparison with such a large database.

评论

Another metric for reconstruction quality especially relevant for protein design is whether the reconstructed backbones are predicted to be realized by the original sequence. This could be evaluated, for example, by obtaining the perplexity of the original sequence when designing the reconstructed backbone using ProteinMPNN.

We agree that evaluating whether the proposed backbone can be realized by the original sequence is an important aspect of assessing generative models in protein design. However, we believe that our self-consistency metric already addresses this concern. The self-consistency score compares our predicted backbone with the refolded sequence generated by ProteinMPNN, effectively testing whether the sequence can be accurately folded into the predicted structure. Since this comparison inherently checks if the generated backbone is compatible with the original sequence's folding potential, we argue that an additional perplexity-based evaluation may be redundant in this context. This approach ensures that our model not only generates biologically plausible structures but also guarantees that they can be realized by the given sequence. Nonetheless we now show in the appendix Figure 22. the evolution of the perplexity of the generated amino acids chains with the length of the chain, as given by ProteinMPNN. Interestingly, we found that ProteinMPNN maintains a low perplexity per residue throughout all sequence lengths.

The paper claims that we want to enforce locality with a special attention mechanism during tokenization. This should be ablated.

This local attention mechanism is a straightforward masking applied to a vanilla cross-attention mechanism. The goal is to prevent the leakage of non-local information. We believe this approach allows for a more meaningful interpretation by preserving a monotonous relationship between the ordering in the primary sequence space and the ordering of latent tokens. We perform the demanded ablation for the 4096 codebook and a downsampling factor of 1, and show that the reconstruction performances degrade significantly when removing it reaching only a median RMSD of 2.9 and a median TM-Score of 0.78. Our assumption that the local attention scheme facilitates the learning of the encoder by only integrating local information seems to be verified.

It would be interesting to see how using a different-sized codebook affects the quality of the downstream generative model. Do we always just want the biggest codebook we can afford?

This question is indeed very relevant. The main goal of the generation experiment is to provide the reader with a key first step towards understanding how efficient the tokenization (and the decoder) can be. In this work, we are far from being able to say that we explore how to best generate protein structures given our tokenizer. The number of tokens, the generation methods, the finetuning strategy, define key aspects that can be explored to refine the generative model that we defer to future works.

Other weaknesses

Algorithm 2 needs to be expanded so that the reader has some reasonable chance of being able to implement it. I do not see how the current description facilitates down- or up-sampling.

We propose in the new version of the updated PDF a more detailed description of algorithm. 2. We highlight that the code and weights are open sourced but we cannot share the link for anonymity purposes and will be made public upon acceptance and in the camera ready version.

While the paper makes it clear later, the initial section describing the token decoder should explicitly state that it uses the AlphaFold2 structure module architecture.

We aim to be as transparent as possible regarding our design choices. We illustrate in section 2.1 and appendices A.1. that our decoder architecture follows strictly AF-2 structure module. We made it even more explicit in the updated version.

It'd be interesting to know whether the generated length distribution matches that of the training set. It'd also be interesting to see how the designability/novelty/diversity metrics look if the baseline models were sampled from at exactly the same empirical distribution as the samples from the tokenized generator.

For training the generative model, in order to increase the number of tokens, we oversample sequences with length over 768 and take 10 chunks of size 512. This biases the generation towards producing longer sequences. A distribution of the length is available in appendix A.3.3 for various sampling temperatures. Note that to compute our metrics for generative experiments, we uniformly sample 8 backbones for every length between 100 and 500 with length step of 5.

We hope to have addressed the reviewers' concerns regarding our work.

评论

About the computing power

quantized autoencoder Original results provided in the paper were obtained using models trained using high batch size for early convergence. We now found that such a batch size was not required. We now train all our models 128 structures per batch, the training can be done on a single TPU v4-8 in ~48 hours and thus less than ~24 hours on a V4-16.

Generative decoder only model: Training the generative models requires 20 hours for 100k training steps on a single TPU V4-8.

评论

As requested by the reviewer, we investigated the impact of tokenizer size on the quality of the generation.

We trained decoder-only models using autoencoders with a downsampling factor of 1 and codebook sizes of 432, 1728, 4096, and 64,000.

Notably, with the newly trained autoencoder and tokenizer, we significantly improved the designability of our generations, achieving self-consistency scores of 89.5% for SCTM and 66.5% for SC-RMSD for the codebook of size 432. Comprehensive results are available in a message accessible to all reviewers.

These results underscore that our approach is not only a robust autoencoder for protein structures but also enables us to achieve excellent generation performance.

We hope these new findings provide you with further insights into the effectiveness and potential of our method.

评论

The authors have done a great job addressing most of my concerns and strengthening the paper. The paper is now properly situated in the existing literature, the methods are clearer, they've backed up the claim that local attention helps, and I'm much more confident that their tokenization method is robust to backbones that aren't in the training set. The improved reconstruction performance also strengthens the submission.

Some points I'd like to address from the discussion:

Another metric for reconstruction quality especially relevant for protein design is whether the reconstructed backbones are predicted to be realized by the original sequence. This could be evaluated, for example, by obtaining the perplexity of the original sequence when designing the reconstructed backbone using ProteinMPNN.

To be clear, I wasn't talking about doing this to evaluate the backbones generated from the autoregressive model. I agree that that would be redundant with the existing self-consistency metrics. I meant to do this as an additional metric for measuring the quality of reconstructions from the tokenizer.

Indeed, the proportion of both novel and designable is indeed a crucial metric when addressing de-novo generation, we want to highlight that the purpose of our generative experiments is to showcase that our learned tokenizer provides an informative description of proteins structures, not to build a new state-of-the-art generative model for protein structures. Our goal is to show that a generative model learned only using simple pretraining methods conducted on our learned representation is indeed already competitive with existing benchmarks.

I agree that the point of the paper is not to show SOTA generative performance and that the existing performance is strong. I'm still curious though, about the proportion of generations that are novel and designable. I think this should be pretty easy to compute given the underlying data used to compute the current metrics. Selfishly, I'd love for the authors to include this in the paper in order to move the field towards reporting this.

While we concur that the novelty would be better estimated on the full PDB dataset rather than only on the CATH, this raises a significant computational challenge running a one on one comparison with such a large database.

This is fine as long as they run generations from every model against the same database.

I'm also still curious about train time/compute!

评论

In light of these revisions, I have updated my review from

Soundness: 2 Presentation: 2 Contribution: 3 Recommendation: 5

To

Soundness:4 Presentation: 3 Contribution: 3 Recommendation: 8

审稿意见
5

This work introduces a method for tokenizing protein backbones. The structure is encoded as a graph to use a GNN as the encoder. Authors use the FSQ method for quantization and reconstructions are trained using the FAPE loss from AF2. To mitigate the fact that many protein families share similar structures, data is sampled inversely proportional to its MMSeqs-assigned cluster. At a codebook size of K=64000, the average reconstruction has 1.59A RMSD and TM score=0.95. Authors further examine the effects of downsampling and altering the number of codes (4096, 64000), finding that using 4096 codes reduces the performance, but not by too much.

优点

  • Structure tokenization is an important area of research, setting the foundational work necessary for many downstream uses.
  • Considering the clustered nature of biological data during training is a good practice that should be more commonly adopted in the field
  • Investigating the optimal codebook size is useful both for understanding proteins and for building similar tools

缺点

There have been a number of structure tokenization methods introduced, including all-atom ones [1,2] and backbone-only [3,4]. These should at minimum be cited, and preferably include a baseline comparison to.

It's certainly not fair to discount the novelty of a piece of work because other works have been since preprinted (I believe a version of this work was preprinted earlier than some of the others listed below). However, there are many open questions that authors could dig into, such as:

  • Better understanding of how codebook size affects reconstruction, see [2]
  • Investigating generation with more rigorous benchmarks; current results are sparse and do not perform great against baselines

At the current state of the field, for this work to be of interest to the ICLR 2025 audience, I think the authors should consider probing further into their results. There's a ton of interesting findings, but introducing a tokenizer in itself (especially given the limited backbone-only setting) might not be a significant contribution.

Minor:

  • line 431: typo at ("Eguchi et al., 2022) ...

References:

[1] Bio2Token: All-atom tokenization of any biomolecular structure with Mamba (https://arxiv.org/abs/2410.19110)

[2] Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure (https://www.biorxiv.org/content/10.1101/2024.08.06.606920v1.full.pdf)

[3] FoldToken: Learning Protein Language via Vector Quantization and Beyond (https://arxiv.org/abs/2403.09673)

[4] ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures (https://www.biorxiv.org/content/10.1101/2023.11.27.568722v2)

问题

  • It's a bit unclear from the writing what the architecture & output of the decoder is. Is it finetuning the structure module of AlphaFold? Or just using the structure module architecture? Or something different altogether (e.g. using a GNN like the encoder?) would be helpful to see this clarified.
评论

There have been a number of structure tokenization methods introduced, including all-atom ones [1,2] and backbone-only [3,4]. These should at minimum be cited, and preferably include a baseline comparison to. It's certainly not fair to discount the novelty of a piece of work because other works have been since preprinted (I believe a version of this work was preprinted earlier than some of the others listed below).

We agree with the reviewer and strongly believe in the relevance of our work.
We want to highlight that at the time of writing most the literature were not even preprinted. Nonetheless, as the reviewers correctly noticed, the field is rapidly evolving. As stated in the response to all reviewers, we have taken this remark into consideration, significantly expanding our related work section and updated our manuscript accordingly.

Better understanding of how codebook size affects reconstruction, see [2]

In the main text, we present a detailed ablation study on codebook size and refer the reviewer to Table 1 in the main paper for comprehensive results. Here, we summarize the key findings. At a fixed downsampling ratio, the validation RMSD decreases as the number of codes in the codebook increases. This trend is particularly pronounced at a high downsampling ratio (4), where reducing the codebook size by a factor of 16 results in a 40% decrease in reconstruction performance. In contrast, when the downsampling ratio is 1, this decrease is only 20%.

To address your question more thoroughly, we retrained our model with additional codebook sizes of 432 and 1728 to achieve further compression. The resulting median RMSD and TM-scores on CASP-15 are 1.75 and 0.90, and 1.33 and 0.94, respectively. These results demonstrate that protein structures can indeed be effectively compressed within a more constrained codebook. Additionally, the manuscript provides an in-depth analysis, including error distributions and visualizations, to illustrate the interplay between downsampling ratios and codebook sizes. This helps better characterize the probability distribution in this challenging benchmark.

We hope these new results adequately address the reviewer’s concerns.

Investigating generation with more rigorous benchmarks; & current results are sparse and do not perform great against baselines Thank you again for this thoughtful feedback, which will be beneficial as we continue to refine our evaluation methods and would be glad to include any actual suggestions in an updated version of the paper. As stated in the main paper, generative results are illustrative of the representation power of the method. We recall the presented generation results are conducted with a decoder-only model, learned solely on a next-token-prediction task using our learned tokenizer. We motivate our choice to use our tokenizer as such to illustrate that even a simple sequence generation method could provide competitive results without any sort of refinements.

There are numerous ways to improve the generation results and we defer them as future work.

At the current state of the field, for this work to be of interest to the ICLR 2025 audience, I think the authors should consider probing further into their results. There's a ton of interesting findings, but introducing a tokenizer in itself (especially given the limited backbone-only setting) might not be a significant contribution.

We strongly disagree with the reviewer. Indeed, as the reviewer suggested, several tokenization methods are now being developed, for both efficient information compression and downstream usage. Therefore our work very much aligns with the interest of the community. We believe that proposing a well tested model, based on grounded methods for efficient structure encoding, decoding and tokenization is not only relevant to the community but also lays the foundations for a thriving line of research.

评论

The process for this work to go through peer-review is a particularly somber depiction of the state of the field. I agree and am aware that this was one of the first works to be preprinted on this subject, and it's rather unfair to discount the novelty of a work because the ML conference circuit comes around too infrequently to catch up to the pace of research in this space.

Nonetheless, by the time April 2025 comes around, I'm just not too sure if this paper in its current form will be significant contribution to readers. This also begs another question of peer-review, which is whether or not we should grant a paper acceptance because it was one of the first to do something, or if it should be based on execution. Authors and I might disagree on what "execution" means, so I'll try to pin this down to two questions: will this be the standard tokenizer that people should go to because its comprehensive evaluation has convinced us that that should be the case? Or is this uncovering some fascinating insights about how to better do this type of work?

In the current state, I'm not sure if this work fulfills any of those two points. Since other tokenizers exist, unless there are benchmarks against other tokenizers, I'm not sure if I would choose to use this one in particular. This is exacerbated by the fact that it is backbone only, and all-atom ones now exist. I'm also not too sure what the takeaways are about what to do when I am building my own tokenizer. The additional codebook sizes are helpful (and I will raise my ratings to address this), but it doesn't make a significant contribution since there were no analysis to trends or hypotheses as to why these codebook sizes are examined; I would prefer to see a line graph or sorts to see gradual performance change with codebook size, so I can choose a codebook size for my needs. The generative results are also not super rigorous and comprehensive; there are a series of analyses that have become common in backbone generation literature (beta sheet percentage, sampling time, motif scaffolding, etc), which this work does not examine.

The authors have put in hard work on this, but I think the depth and rigor of the analyses provided here are insufficient to convince me that ICLR would be the right venue for this work, in its current state.

评论

"I agree and am aware that this was one of the first works to be preprinted on this subject, and it's rather unfair to discount the novelty of a work because the ML conference circuit comes around too infrequently to catch up to the pace of research in this space."

We thank the reviewer for acknowledging this point, however, we strongly disagree with the statement :

"Nonetheless, by the time April 2025 comes around, I'm just not too sure if this paper in its current form will be a significant contribution to readers."

We evidenced a strong case showing that our proposition could be of interest for the practitioner thanks to:

  • Information encoding and reconstruction: Our tokenizer achieves reconstruction performance at the level of experimental resolution using only backbone structural information, unlike prior works [2, 3], which incorporate sequence data. This capability highlights the potential of our method to support multimodal protein language models.
  • Generation: when training "a simple off-the shelf decoder-only transformer" without any refinement, we provided generation results that have shown to be competitive with diffusion models specifically tailored to excel at that task.

"...will this be the standard tokenizer that people should go to because its comprehensive evaluation has convinced us that that should be the case? Or is this uncovering some fascinating insights about how to better do this type of work?"

Our previous answer stresses the methodological and technical merits of our approach. Moreover, thanks to the reviewers' insightful feedback we have been able to further study how the codebook size and downsampling ratio affects reconstruction.

In conclusion, our present study provides the reader thanks to extensive analyses with practical guidelines on how to train and utilize our structure-only tokenizer effectively, providing clarity on expected performance for reconstruction and downstream applications.

"Since other tokenizers exist, unless there are benchmarks against other tokenizers, I'm not sure if I would choose to use this one in particular."

As you previously highlighted, indeed, we assume that a comprehensive evaluation that convinced reviewers of its merits along with open source code (that can already be accessed but that will not be disclosed here for anonymity purpose, but will be for the camera ready version) constitutes a strong case for considering its use.

"This is exacerbated by the fact that it is backbone only, and all-atom ones now exist. I'm also not too sure what the takeaways are about what to do when I am building my own tokenizer."

We highlight that backbone-only is not a precursor to all-atom but rather is a deliberate design choice as it allows the tokenisation model to run without context of the sequence (which would otherwise be implicit in the side-chain atoms). For our purposes this held two primary benefits. (1) The ability to have modality specific pre-training of tokenisation models. (2) The use of our tokenizer in protein design tasks where the sequence is not a priori known - for example, a common workflow for ML-based protein engineering is to first generate the protein backbone (e.g. with RFDiffusion), then compute the corresponding sequence (e.g. with ProteinMPNN).

We will make sure to further highlight in our manuscript the benefits of this specific design choice.

"The additional codebook sizes are helpful (and I will raise my ratings to address this)"

We are grateful for the reviewers comments. We hope that the above discussion will allow further re-consideration of the reviewers score; but note that at present the suggested ratings increase has not happened.

评论

We would like to make sure if the reviewer have had the opportunity to review the additional elements and analysis provided notably regarding the your latest comments, further highlighting the merits and differences of our approach.

We have addressed in this rebuttal many of the reviewer's concerns, including additional analysis on the codebook size, and in depth analysis on the errors as they requested.

Also, the reviewer stated we would increase its ratings, and we have been unable to notice such a change.

We'd appreciate to hear your feedback.

评论

As requested by the reviewer, we investigated the impact of tokenizer size on generative quality. We trained decoder-only models using autoencoders with a downsampling factor of 1 and codebook sizes of 432, 1728, 4096, and 64,000.

Notably, with the newly trained autoencoder and tokenizer, we significantly improved the designability of our generations, achieving self-consistency scores of 89.5% for SCTM and 66.5% for SC-RMSD for the codebook of size 432. Comprehensive results are available in a message accessible to all reviewers.

These results underscore that our approach is not only a robust autoencoder for protein structures but also enables us to achieve excellent generation performance.

We hope these new findings provide you with further insights into the effectiveness and potential of our method.

审稿意见
6

The work proposes to use an FSQ-based auto-encoder to tokenize 3D protein structures at the residue level. They use a GNN encoder, IPA decoder, and a length compression scheme. They find that their autoencoder is able to achieve experimental precision in terms of RMSD and measure the RMSD and TM-score as a function of downsampling ratio/compression factor and codebook size. They then train a GPT model on the generated tokens and outperform FrameDiff but not RFDiffusion on scTM and scRMSD scores. They tend to lag behind in terms of novelty/diversity, but they note that there may be a tradeoff between novelty/diversity and validity.

优点

  1. The problem is very well-motivated: Tokenizing 3D structures allows for multi-modal integration in downstream language models.
  2. To the best of my knowledge, autoencoder-based tokenizers are pretty novel in the field of computational protein biology.
  3. Experiments, though limited, are sufficient to demonstrate that their tokens can be used for both reconstruction and downstream language modeling.

缺点

  1. An ablation study comparing the FS quantization with VQ would help motivate the architectural choice.
  2. You can include an ablation study demonstrating the importance of invariance in your encoder architecture. Consider trying a non-invariant architecture and see how that impacts downstream language modeling.
  3. The downsampling of the sequence makes it harder for the tokenizer to reconstruct the structure, but also makes it easier for the language model to learn. It would be great to demonstrate this tradeoff in your experiments. Also, I think the same can be said about the invariant feature representation.
  4. You may want to compare with a hand-crafted tokenizer (e.g. BPE-based tokenizer) in terms of reconstruction accuracy, downstream generation validity/diversity.

问题

  1. Have you tried extending to side chain conformations?
评论

An ablation study comparing the FS quantization with VQ would help motivate the architectural choice.

We began by exploring traditional VQ-VAE-based architecture. However, training the encoder within this framework proved challenging, necessitating numerous adjustments—such as k-means initialization, EMA updates, and periodic resets—to maintain stability. Despite these efforts, the approach remained unstable, frequently suffering from latent collapse or poor expressivity. In contrast, our current FSQ-based autoencoder yields results that are significantly more robust and stable, outperforming the VQ-VAE approach by several orders of magnitude.

You can include an ablation study demonstrating the importance of invariance in your encoder architecture. Consider trying a non-invariant architecture and see how that impacts downstream language modeling.

The use of invariant / equivariant features in structure modeling is an open debate in the field, whilst we concur with the reviewer that ablating this component also provides application specific elements to the debate, it is not the primary goal of our work. Nevertheless, we hypothetize the invariant encoder to be important in our work, maintaining desirable properties of the latent space. Indeed, it ensures that two structures differing only by rotation or a translation have similar latent representations. We also incorporate as a benchmark the use of a non equi/invariant transformer encoder so as to provide the reader with a comparison with a natural benchmark. We report the metrics for the autoencoding.

The downsampling of the sequence makes it harder for the tokenizer to reconstruct the structure, but also makes it easier for the language model to learn. It would be great to demonstrate this tradeoff in your experiments. Also, I think the same can be said about the invariant feature representation.

Our rationale behind these choices is grounded in the observed training behavior of language models, as documented by Hoffmann et al. [1]. A common rule of thumb when training decoder-only models is to ensure there are approximately 20 observed tokens per parameter, adjusted by the size of the vocabulary. Achieving this balance typically requires avoiding downsampling, as downsampling would significantly reduce the number of observed tokens, limiting model performance. We agree with the reviewer that exploring the training of decoder-only models with higher downsampling ratios is promising, we must point out that it would introduce challenges: increased downsampling would reduce the amount of training data, impairing model training capacity. Additionally, the hyperparameters required for training with different downsampling rates would likely diverge significantly, complicating direct comparisons between models.

In that sense, this paves the way to another line of research studying both the optimal tokenizer given a generation strategy (here through next token prediction).

[1]: Hoffmann et al, Training Compute-Optimal Large Language Models

You may want to compare with a hand-crafted tokenizer (e.g. BPE-based tokenizer) in terms of reconstruction accuracy, downstream generation validity/diversity.

Our language model is learned on a next-token-prediction task using our learned tokenizer. We motivate our choice to use our tokenizer as such to illustrate that even a simple sequence generation method could provide competitive results without any sort of refinements. Naturally we could apply further refinements either in the tokenization, leveraging BPE, or in the generation, for instance applying some preference fine tuning methods, however, we believe that evidencing the “vanilla” generation power is more useful to practitioners and opens the way to further works.

Have you tried extending to side chain conformations?

We thank the reviewer for this comment. Including side chain information can indeed provide the model with further valuable information and as suggested, we plan to include it in future works.

评论

As requested by the reviewer, we investigated the impact of tokenizer size on generative quality. We trained decoder-only models using autoencoders with a downsampling factor of 1 and codebook sizes of 432, 1728, 4096, and 64,000.

Notably, with the newly trained autoencoder and tokenizer, we significantly improved the designability of our generations, achieving self-consistency scores of 89.5% for SCTM and 66.5% for SC-RMSD for the codebook of size 432. Comprehensive results are available in a message accessible to all reviewers.

These results underscore that our approach is not only a robust autoencoder for protein structures but also enables us to achieve excellent generation performance.

We hope these new findings provide you with further insights into the effectiveness and potential of our method.

评论

We thank the reviewers for their valuable comments and suggestions.

As highlighted by several reviewers, protein structure quantization has recently gained significant momentum in the research community. Foldseek [1], one of the earliest and most established works in this area, introduces a structural vocabulary that encodes only local geometry. They demonstrate its utility for fast database searches, and subsequent works [2, 11] have leveraged this vocabulary to train structure-aware protein language models. However, as noted in our manuscript, these approaches lack the capacity to provide global representations of protein structures, as the vocabulary encodes only local structural features. As the reviewers noted, several works are closely related to our method. Specifically, [12, 13, 14, 15] employ VQ-VAE or variations to train protein structure tokenizers. We would like to emphasize that these works are either concurrent with or subsequent to our own. Similarly, many recent studies have explored the use of tokenized structures in training multimodal generative models, including [4, 7, 9, 10, 16]. We agree with the reviewer that including these references provides readers with a more comprehensive understanding of the field and its recent advancements. Accordingly, we have updated the manuscript to reflect these developments.

Novel Results

Data Split and Novel Training

As suggested by reviewer (j5Mn), due to our initial uniform sampling, our test performances may over-estimate the reconstruction power of our model for instance on OOD samples. To address the reviewers’ remark and to show the generalization power of our algorithm, we redesigned our training scheme: For our training, we randomly select 90% of the cluster and use the remaining clusters for validation and testing. Amongst these 10% withhold protein clusters, we retain 20% for validation, the remaining 80% being used for tests. This ensures that the test structures were never seen during training. This has also been updated in the main paper.

Metrics Reported

This new training scheme enables us to further evidence the capacity of our model. We now report: 1- The average reconstruction metrics on the test set, containing only structures belonging to left out clusters, i.e. from clusters carefully excluded from the training set. 2- The median reconstruction metrics (due to the limited dataset size) on CASP15 structures, known to be structurally complex Also, with such a retraining, we highlight that we have been able to improve our reconstruction metrics by more than 10% across all downsampling ratios and codebook sizes along with the ability to train our model on a smaller batch size drastically reducing the computation power necessary for training.

Novel Results, Ablations and Novel Codebook sizes!

Codebook SizedfCASP15-RMSDCASP15-TMTest RMSDTest-TM
409611.250.941.550.94
6400010.940.971.220.96
409621.730.892.220.90
6400021.820.901.950.92
409642.790.774.10.81
6400042.550.802.960.86

Table: 1 Updated Results for the main experiments with additional metrics on CASP-15

In order to also validate our proposition in restricted information settings,as required by reviewer (pVnp) we now include novel experiments, evidencing the capacity of our algorithm to train on restricted codebook size, effectively decreasing in the case of a downsampling ratio of 1, down to 1728 and even 432 codes.

Codebook SizedfCASP15-RMSDCASP15-TMTest RMSDTest-TM
43211.750.892.090.91
172811.330.941.790.93

Table 2: Novel reconstruction results with limited codebook sizes As required by reviewer (j5Mn), we now provide in appendix A.4 the boxplot of both RMSD and TM-score for the results of CASP15 and also both RMSD and TM-Scores distribution for the test set.

评论

Ablations

Notably as asked by reviewers (j5Mn, Ai4U), we included for the 4096 codebook size and downsampling ratio of 1, two ablations:

  1. Removing the local attention.
  2. A non equivariant encoder (replacing the GNN with a non-equivariant graph transformer). To ensure it has the same information as our original encoder, we modify the regular transformer architecture:
  • Initial embeddings: the flattened edges features concatenated with the residues’ backbone positions (making the encoder non equivariant).
  • Bias the attention score: using linear transformation of the original edges features effectively providing information for the computation of the embedding -Key offsetting: we make the key being the average of the original key and a linear transformation of the node’s flattened edges features.
Model# CodesdfCASP15-RMSDCASP15-TMTest RMSDTest-TM
GraphTransformer409611.300.941.760.94
Non Local Attention409612.90.784.90.83

Table 3: Ablation Studies

We can notice that our local attention scheme enables very consistent training, and removing it significantly hampers training. Interestingly, the graph transformer architecture we propose is very consistent with our original encoder proposition.

Main paper corrections

All corrections, additions and new results are now included in the main paper, notably for the related work and the experimental sections.

[1] van Kempen, Michel, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. "Foldseek: fast and accurate protein structure search." Biorxiv (2022): 2022-02.

[2] Trinquier, Jeanne, Samantha Petti, Shihao Feng, Johannes Söding, Martin Steinegger, and Sergey Ovchinnikov. "SWAMPNN: End-to-end protein structures alignment." In Machine Learning for Structural Biology Workshop, NeurIPS. 2022.

[3] Lin, Xiaohan, Zhenyu Chen, Yanheng Li, Zicheng Ma, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Yi Qin Gao, and Jun Zhang. "Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary." bioRxiv (2023): 2023-11.

[4] Gao, Zhangyang, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, and Stan Z. Li. "Foldtoken: Learning protein language via vector quantization and beyond." arXiv preprint arXiv:2403.09673 (2024).

[5] Li, Mingchen, Yang Tan, Xinzhu Ma, Bozitao Zhong, Huiqun Yu, Ziyi Zhou, Wanli Ouyang, Bingxin Zhou, Liang Hong, and Pan Tan. "ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention." bioRxiv (2024): 2024-04.

[6] Gaujac, Benoit, Jérémie Donà, Liviu Copoiu, Timothy Atkinson, Thomas Pierrot, and Thomas D. Barrett. "Learning the Language of Protein Structure." arXiv preprint arXiv:2405.15840 (2024).

[7] Hayes, Tomas, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil et al. "Simulating 500 million years of evolution with a language model." bioRxiv (2024): 2024-07.

[8] Lu, Amy X., Wilson Yan, Kevin K. Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, and Nathan Frey. "Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure." bioRxiv (2024): 2024-08.

[9] Wang, Xinyou, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. "DPLM-2: A Multimodal Diffusion Protein Language Model." arXiv preprint arXiv:2410.13782 (2024).

[10] Lu, Jiarui, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. "Structure Language Models for Protein Conformation Generation." arXiv preprint arXiv:2410.18403 (2024)

[11] Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou and Fajie Yuan “SaProt: Protein Language Modeling with Structure-aware Vocabulary”, ICLR 2024

[12] Z. Gao, C. Tan, and S. Z. Li. “VQPL: Vector quantized protein language”. arXiv, 2023.

[13] X. Lin, Z. Chen, Y. Li, X. Lu, C. Fan, Z. Cao, S. Feng, Y. Q. Gao, and J. Zhang. “Protokens: A machine-learned language for compact and informative encoding of protein 3d structures.” bioRxiv, 2023.

[14] Y. Liu, L. Chen, and H. Liu. “Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions.” bioRxiv, 2023.

[15] A. Liu, A. Elaldi, N. Russell, and O. Viessmann. “Bio2token: All-atom tokenization of any biomolecular structure with mamba.” arXiv, 2024.

[16] Z. Gao, C. Tan, and S. Z. Li. “Foldtoken2: Learning compact, invariant and generative protein structure language.” bioRxiv, 2024.

评论

The biggest discrepancy between me and the other reviewers seems to be whether the paper has sufficient technical novelty. In my view, while neither of the components (tokenizing structure and autoregressive generation) are novel, this is the first work I've seen combine those two. This was probably an obvious application from the moment FoldSeek was released, but it is nevertheless important to have a well-executed first attempt that tries to improve the tokenizer and also characterizes performance compared to the current standard diffusion methods.

评论

Dear reviewers,

Tomorrow (Nov 26) is the last day for asking questions to the authors. With this in mind, please read the rebuttal provided by the authors earlier today, as well as the other reviews. Please explicitly acknowledge that you have read the rebuttal and reviews, provide your updated view accompanied by a motivation, and raise any outstanding questions for the authors.

Timeline: As a reminder, the review timeline is as follows:

  • November 26: Last day for reviewers to ask questions to authors.
  • November 27: Last day for authors to respond to reviewers.
  • November 28 - December 10: Reviewer and area chair discussion phase.

Thank you again for your hard work,

Your AC

评论

As requested by several reviewers [Ai4U, pVNP, J5MN], we explored the impact of codebook size on generative quality.

To this end, we utilize the autoencoders trained with a downsampling factor of 1 and codebook sizes of 432, 1728, 4096, and 64,000. Using the corresponding tokenized datasets (generated with the respective encoder and codebook), we then trained four decoder-only transformer models for next-token prediction. We provide the designability and diversity results, computed on 256 structures generated using a decoder-only model with similar hyperparameters as those described in the manuscript. All decoder-only models were trained with the same batch size, learning rate and hyperparameters to ensure consistency.

Codebook SizeSCTM (%)SC-RMSD (%)Diversity
43289,5 %66.5 %63%
172888,6 %64.8 %66%
409687.2 %61.8 %60 %
64,00084.2 %56.2 %61.3 %
Table 1: Generative self consistency, novelty and diversity metrics

Key Insights

The results presented here were achieved using a generator trained exclusively on the next-token prediction task, without any additional fine-tuning or refinement. Our results suggest that, given the available training data, a smaller vocabulary size can effectively produce high-quality protein structures. We managed to boost our self consistency results going from a scRMSD of 42% (for the 4096 codebook) to 61.8% using the same codebook size and even reaching 66.5% with a codebook size of 432.

Significance of Findings

We believe these additional experiments further validate our proposed framework, demonstrating its dual capability as:

  • A robust autoencoder for protein structures, and

  • A promising initial step toward de novo structure generation. These novel results suggest that our approach is closing the gap with tailored generative methods such as RF-Diffusion, offering a new and scalable pathway for protein design. We will make sure to include these results in the camera-ready version of our manuscript and to make available the pretrained weights for the generation.

AC 元评审

This submission ended up with mixed reviews. While initially all reviewers except one were leaning towards recommending to reject this paper, the lively discussion with the authors led to a more split situation with two reviewers in favor of accepting and two reviewers against.

Several reviewers asked the authors to position their work better compared to (very recent) related work in their related work section. The authors have complied with this request and updated their related work section. Several reviewers, in particular reviewers Q6dB and pVnp, raised concerns about insufficient evaluation of the proposed method, indicating this could limit the potential community interest for this submission. Concerns were raised about the need to compare or ablate different tokenization choices. Furthermore, reviewer pVnp found there were missing metrics for the protein generation task that are used in other places in the literature (beta sheet percentage, sampling time, motif scaffolding, etc). During the discussion phase, the authors have put in significant effort to increase the thoroughness of their evaluation, for instance by looking into the effect of codebook sizes. The authors have not included the additional metrics that were requested by pVnp, but reviewer j5Mn was of the opinion that extending to these suggested metrics is a nice-to-have and not a must have for acceptance. While I agree that the additional metrics are a-nice-to-have, and that this alone is not sufficient to reject the paper, I recommend rejecting this paper based on the shared concerns by reviewers Q6dB and pVnp that this paper has not presented sufficient empirical evidence of the value of their proposed tokenization approach on downstream tasks. I would like to thank the authors and reviewers for engaging in discussions and would strongly encourage the authors to resubmit an improved version of their work with more in-depth ablations to a future conference or venue.

审稿人讨论附加意见

During the discussion period two reviewers have increased their scores from a 3 to 5 and 5 to 8 respectively, and one reviewer has lowered their score from a 5 to a 3. The extensive work that the authors have done to generate additional results, and the effort they have put into positioning their paper better with respect to related work have clearly contributed to the increase in ratings, but has unfortunately not sufficiently addressed the concerns by the shared concerns by two reviewers.

最终决定

Reject