Department of Biochemistry, University of Bayreuth, Bayreuth, Germany.
Institute of Informatics and Applications, University of Girona, Girona, Spain.
Nat Commun. 2022 Jul 27;13(1):4348. doi: 10.1038/s41467-022-32007-7.
Protein design aims to build novel proteins customized for specific purposes, thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in Transformer-based architectures has enabled the implementation of language models capable of generating text with human-like capabilities. Here, motivated by this success, we describe ProtGPT2, a language model trained on the protein space that generates de novo protein sequences following the principles of natural ones. The generated proteins display natural amino acid propensities, while disorder predictions indicate that 88% of ProtGPT2-generated proteins are globular, in line with natural sequences. Sensitive sequence searches in protein databases show that ProtGPT2 sequences are distantly related to natural ones, and similarity networks further demonstrate that ProtGPT2 is sampling unexplored regions of protein space. AlphaFold prediction of ProtGPT2-sequences yields well-folded non-idealized structures with embodiments and large loops and reveals topologies not captured in current structure databases. ProtGPT2 generates sequences in a matter of seconds and is freely available.
蛋白质设计旨在构建针对特定目的定制的新型蛋白质,从而有可能解决许多环境和生物医学问题。基于转换器的架构的最新进展使得能够实现能够生成具有类似人类能力的文本的语言模型。在这里,受到这一成功的启发,我们描述了 ProtGPT2,这是一个在蛋白质空间上训练的语言模型,它根据自然原则生成新的蛋白质序列。生成的蛋白质显示出自然的氨基酸倾向,而无序预测表明 88%的 ProtGPT2 生成的蛋白质是球状的,与天然序列一致。在蛋白质数据库中进行敏感的序列搜索表明,ProtGPT2 序列与天然序列相差甚远,相似性网络进一步表明 ProtGPT2 正在探索蛋白质空间中未被探索的区域。ProtGPT2 序列的 AlphaFold 预测产生了具有实体和大环的折叠良好的非理想结构,并揭示了当前结构数据库中未捕获的拓扑结构。ProtGPT2 可以在几秒钟内生成序列,并且可以免费使用。