Google Research, Cambridge, MA, USA.
The Francis Crick Institute, London, UK.
Nat Biotechnol. 2022 Jun;40(6):932-937. doi: 10.1038/s41587-021-01179-w. Epub 2022 Feb 21.
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.
理解氨基酸序列和蛋白质功能之间的关系是一个长期存在的挑战,具有深远的科学和转化意义。最先进的基于比对的技术无法预测三分之一的微生物蛋白质序列的功能,这阻碍了我们利用来自不同生物体的数据的能力。在这里,我们训练深度学习模型,以在严格的基准评估中准确预测未对齐的氨基酸序列的功能注释,这些基准评估是基于蛋白质家族数据库 Pfam 的 17929 个家族构建的。这些模型推断出已知的进化替代模式,并学习准确地对来自未见家族的序列进行聚类的表示。将深度学习模型与现有方法相结合,可显著提高远程同源性检测的效果,这表明深度学习模型学习到了互补信息。这种方法将 Pfam 的覆盖范围扩大了超过 9.5%,超过了过去十年的增加量,并预测了 360 个人类参考蛋白质组蛋白以前没有 Pfam 注释的功能。这些结果表明,深度学习模型将成为未来蛋白质注释工具的核心组成部分。