Department of Computer Science, Stanford University, Stanford, California, USA.
Proteins. 2023 Aug;91(8):1089-1096. doi: 10.1002/prot.26494. Epub 2023 May 9.
Machine learning research concerning protein structure has seen a surge in popularity over the last years with promising advances for basic science and drug discovery. Working with macromolecular structure in a machine learning context requires an adequate numerical representation, and researchers have extensively studied representations such as graphs, discretized 3D grids, and distance maps. As part of CASP14, we explored a new and conceptually simple representation in a blind experiment: atoms as points in 3D, each with associated features. These features-initially just the basic element type of each atom-are updated through a series of neural network layers featuring rotation-equivariant convolutions. Starting from all atoms, we further aggregate information at the level of alpha carbons before making a prediction at the level of the entire protein structure. We find that this approach yields competitive results in protein model quality assessment despite its simplicity and despite the fact that it incorporates minimal prior information and is trained on relatively little data. Its performance and generality are particularly noteworthy in an era where highly complex, customized machine learning methods such as AlphaFold 2 have come to dominate protein structure prediction.
近年来,机器学习在蛋白质结构方面的研究越来越受到关注,为基础科学和药物发现带来了有前景的进展。在机器学习背景下处理大分子结构需要适当的数值表示,研究人员已经广泛研究了图、离散 3D 网格和距离图等表示形式。作为 CASP14 的一部分,我们在一项盲实验中探索了一种新的、概念上简单的表示形式:将原子表示为 3D 中的点,每个点都有相关特征。这些特征最初只是每个原子的基本元素类型,通过一系列具有旋转不变卷积的神经网络层进行更新。从所有原子开始,我们在α碳水平上进一步聚合信息,然后在整个蛋白质结构水平上进行预测。我们发现,尽管这种方法简单,并且仅包含最少的先验信息,并且仅在相对较少的数据上进行训练,但它在蛋白质结构预测中高度复杂、定制化的机器学习方法(如 AlphaFold 2)占据主导地位的时代,仍能产生有竞争力的结果。