Suppr超能文献

通过预训练语言模型和基于同源性的标签扩散,从序列快速准确地预测蛋白质功能。

Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion.

机构信息

School of Computer Science and Engineering at Sun Yat-sen University.

Sun Yat-sen Memorial Hospital at Sun Yat-sen University.

出版信息

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad117.

Abstract

Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor, which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5, 27.3 and 10.1% in area under the precision-recall curve on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction. The datasets, source codes and trained models of SPROF-GO are available at https://github.com/biomed-AI/SPROF-GO. The SPROF-GO web server is freely available at http://bio-web1.nscc-gz.cn/app/sprof-go.

摘要

蛋白质功能预测是生物信息学中的一项基本任务,有助于阐明疾病机制和发现药物靶点。由于序列数据库中蛋白质的数量呈爆炸式增长,且其功能多样,因此仅从序列快速准确地预测蛋白质功能仍然具有挑战性。尽管许多方法已经整合了蛋白质结构、生物网络或文献信息以提高性能,但这些额外的特征通常大多数蛋白质都无法获得。在这里,我们提出了 SPROF-GO,这是一种基于序列的无比对蛋白质功能预测器,它利用预先训练的语言模型来有效地提取信息丰富的序列嵌入,并采用自注意力池化来关注重要的残基。通过利用同源信息并通过标签扩散算法考虑具有相关功能的蛋白质的重叠社区,进一步提高了预测性能。SPROF-GO 在三个子本体测试集上的精度-召回曲线下面积分别超过了最先进的基于序列的方法,甚至是基于网络的方法 14.5%、27.3%和 10.1%。我们的方法还在非同源蛋白质和未见的物种上表现出良好的泛化能力。最后,基于注意力机制的可视化表明,SPROF-GO 能够捕获对功能预测有用的序列域。SPROF-GO 的数据集、源代码和训练模型可在 https://github.com/biomed-AI/SPROF-GO 上获得。SPROF-GO 的网络服务器可在 http://bio-web1.nscc-gz.cn/app/sprof-go 上免费获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验