Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas 77843, United States.
TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, Texas 77840, United States.
J Chem Inf Model. 2020 Dec 28;60(12):5667-5681. doi: 10.1021/acs.jcim.0c00593. Epub 2020 Sep 30.
Although massive data is quickly accumulating on protein sequence and structure, there is a small and limited number of protein architectural types (or structural folds). This study is addressing the following question: how well could one reveal underlying sequence-structure relationships and design protein sequences for an arbitrary, potentially novel, structural fold? In response to the question, we have developed novel deep generative models, namely, semisupervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks). To overcome training difficulties and improve design qualities, we build our models on conditional Wasserstein GAN (WGAN) that uses Wasserstein distance in the loss function. Our major contributions include (1) constructing a low-dimensional and generalizable representation of the fold space for the input, (2) developing an ultrafast sequence-to-fold predictor (or oracle) and incorporating its feedback into WGAN as a loss to model training, and (3) exploiting sequence data with and without paired structures to enable a training strategy. Assessed by the oracle over 100 novel folds not in the training set, gcWGAN generates more successful designs and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). Assessed by sequence- and structure-based predictors, gcWGAN designs are physically and biologically sound. Assessed by a structure predictor over representative novel folds, including one not even part of basis folds, gcWGAN designs have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. The ultrafast data-driven model is further shown to boost the success of a principle-driven de novo method (RosettaDesign), through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning generalizable principles from current sequence-structure data. Data, source codes, and trained models are available at https://github.com/Shen-Lab/gcWGAN.
尽管蛋白质序列和结构方面的大量数据正在迅速积累,但蛋白质结构类型(或结构折叠)的数量却很少且有限。本研究旨在探讨以下问题:对于任意潜在的新型结构折叠,能否很好地揭示潜在的序列-结构关系并设计蛋白质序列?针对该问题,我们开发了新颖的深度生成模型,即半监督 gcWGAN(有指导的、条件的、Wasserstein 生成对抗网络)。为了克服训练困难并提高设计质量,我们在条件 Wasserstein GAN(WGAN)的基础上构建了模型,该模型在损失函数中使用 Wasserstein 距离。我们的主要贡献包括:(1)为输入构建折叠空间的低维且可推广的表示;(2)开发超快的序列到折叠预测器(或“oracle”),并将其反馈纳入 WGAN 作为损失以指导模型训练;(3)利用具有和不具有配对结构的序列数据来实现训练策略。通过在 100 多个不在训练集中的新型折叠上进行“oracle”评估,gcWGAN 生成的成功设计更多,涵盖的目标折叠数量是竞争数据驱动方法(cVAE)的 3.5 倍。通过基于序列和结构的预测器评估,gcWGAN 设计在物理和生物学上是合理的。通过代表性新型折叠的结构预测器评估,包括一个甚至不是基础折叠一部分的折叠,gcWGAN 设计的折叠准确性可与之媲美或更高,但序列多样性和新颖性却远高于 cVAE。通过生成设计种子和调整设计空间,超快的基于数据的模型进一步提高了基于原理的从头设计方法(RosettaDesign)的成功率。总之,gcWGAN 通过从当前的序列-结构数据中学习可推广的原则,探索未知的序列空间来设计蛋白质。数据、源代码和训练好的模型可在 https://github.com/Shen-Lab/gcWGAN 上获取。