利用 AlphaFold 预测的蛋白质结构提高蛋白质功能预测性能。
Enhancing Protein Function Prediction Performance by Utilizing AlphaFold-Predicted Protein Structures.
机构信息
College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China.
High Performance Computing Center, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao 266237, China.
出版信息
J Chem Inf Model. 2022 Sep 12;62(17):4008-4017. doi: 10.1021/acs.jcim.2c00885. Epub 2022 Aug 25.
The structure of a protein is of great importance in determining its functionality, and this characteristic can be leveraged to train data-driven prediction models. However, the limited number of available protein structures severely limits the performance of these models. AlphaFold2 and its open-source data set of predicted protein structures have provided a promising solution to this problem, and these predicted structures are expected to benefit the model performance by increasing the number of training samples. In this work, we constructed a new data set that acted as a benchmark and implemented a state-of-the-art structure-based approach for determining whether the performance of the function prediction model can be improved by putting additional AlphaFold-predicted structures into the training set and further compared the performance differences between two models separately trained with real structures only and AlphaFold-predicted structures only. Experimental results indicated that structure-based protein function prediction models could benefit from virtual training data consisting of AlphaFold-predicted structures. First, model performances were improved in all three categories of Gene Ontology terms (GO terms) after adding predicted structures as training samples. Second, the model trained only on AlphaFold-predicted virtual samples achieved comparable performances to the model based on experimentally solved real structures, suggesting that predicted structures were almost equally effective in predicting protein functionality.
蛋白质的结构对于确定其功能非常重要,这一特性可以被利用来训练基于数据的预测模型。然而,可用的蛋白质结构数量有限,严重限制了这些模型的性能。AlphaFold2 及其开源的预测蛋白质结构数据集为解决这一问题提供了一个有希望的解决方案,并且这些预测结构有望通过增加训练样本的数量来提高模型性能。在这项工作中,我们构建了一个新的数据集作为基准,并实现了一种最先进的基于结构的方法,用于确定通过将额外的 AlphaFold 预测结构添加到训练集中是否可以提高功能预测模型的性能,并进一步比较了仅使用真实结构和仅使用 AlphaFold 预测结构分别训练的两个模型之间的性能差异。实验结果表明,基于结构的蛋白质功能预测模型可以从由 AlphaFold 预测结构组成的虚拟训练数据中受益。首先,在添加预测结构作为训练样本后,所有三个基因本体术语 (GO 术语) 类别的模型性能都得到了提高。其次,仅在 AlphaFold 预测的虚拟样本上训练的模型达到了与基于实验解决的真实结构的模型相当的性能,这表明预测结构在预测蛋白质功能方面几乎同样有效。