蛋白质序列到结构的学习：这是（端到端革命）的终结吗？

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

机构信息

Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France.

Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA.

出版信息

Proteins. 2021 Dec;89(12):1770-1786. doi: 10.1002/prot.26235. Epub 2021 Sep 22.

DOI:10.1002/prot.26235

PMID:34519095

Abstract

The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.

摘要

深度学习在蛋白质结构预测领域已经得到了一段时间的认可，在 CASP13 之后，其潜力变得不可置疑。在 CASP14 中，深度学习将该领域提升到了意想不到的水平，达到了近乎实验的准确性。这一成功源于从其他机器学习领域转移而来的进展，以及专门针对蛋白质序列和结构及其抽象的方法。新出现的方法包括：（i）几何学习，即在图、三维（3D）Voronoi 胞腔和点云等表示形式上进行学习；（ii）利用注意力机制的预先训练的蛋白质语言模型；（iii）保持 3D 空间对称性的等变架构；（iv）使用大型元基因组数据库；（v）蛋白质表示的组合；以及（vi）最后真正的端到端架构，即从序列开始并返回 3D 结构的可微模型。在这里，我们概述并评价了过去 2 年中在 CASP14 中广泛使用的新的深度学习方法。