Suppr超能文献

通过异质特征融合大规模预测蛋白质功能。

Large-scale predicting protein functions through heterogeneous feature fusion.

机构信息

School of Computer Science and Engineering, Central South University, 410000 Changsha, China.

出版信息

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad243.

Abstract

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.

摘要

随着蛋白质序列和结构数据量的快速增长,绝大多数蛋白质的功能无法通过实验来确定。大规模自动化注释蛋白质功能变得越来越重要。现有的计算预测方法通常基于将相对较少的实验确定功能扩展到具有各种线索的大量蛋白质,包括序列同源性、蛋白质-蛋白质相互作用、基因共表达等。尽管近年来蛋白质功能预测取得了一些进展,但开发准确可靠的解决方案仍有很长的路要走。在这里,我们利用 AlphaFold 预测的三维结构信息,以及其他非结构线索,开发了一种大规模的方法,称为 PredGO,用于注释蛋白质的基因本体 (GO) 功能。我们使用预先训练的语言模型、几何向量感知机和注意力机制来提取蛋白质的异构特征,并融合这些特征进行功能预测。计算结果表明,与其他预测蛋白质 GO 功能的最先进方法相比,该方法在覆盖度和准确性方面都有了显著提高。覆盖度的提高是因为 AlphaFold 预测的结构数量大大增加,另一方面,PredGO 可以广泛利用非结构信息进行功能预测。此外,我们表明,PredGO 注释了 UniProt 中超过 205000 个人类 ($\sim$100%)条目,其中超过 186000 个 ($\sim$90%)是基于预测结构的。该服务器和数据库可在 http://predgo.denglab.org/ 上获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验