HFSP：高速同源驱动的蛋白质功能注释。

HFSP: high speed homology-driven function annotation of proteins.

机构信息

Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.

Computational Biology & Bioinformatics - i12 Informatics, Technical University of Munich (TUM), Munich, Germany.

出版信息

Bioinformatics. 2018 Jul 1;34(13):i304-i312. doi: 10.1093/bioinformatics/bty262.

DOI:10.1093/bioinformatics/bty262

PMID:29950013

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6022561/

Abstract

MOTIVATION

The rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annotations between proteins. The increase in the number of available sequences, however, has drastically increased the search space, thus significantly slowing down alignment methods.

RESULTS

Here we describe homology-derived functional similarity of proteins (HFSP), a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (85% precision) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 16% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

测序成本的迅速下降产生了大量（预测的）蛋白质序列，这些序列仅凭湿实验无法合理地进行功能注释。因此，已经开发了许多用于此目的的计算方法。这些方法大多采用基于同源性的推断，通过序列比对进行近似，以在蛋白质之间转移功能注释。然而，可用序列数量的增加极大地增加了搜索空间，从而大大降低了对齐方法的速度。

结果

在这里，我们描述了蛋白质同源衍生的功能相似性（HFSP），这是一种新的计算方法，它使用高速对齐算法 MMseqs2 的结果，根据其对齐长度和序列同一性推断蛋白质的功能相似性。我们表明，我们的方法是准确的（85%的精度）和快速的（比最先进的方法快 40 多倍）。HFSP 甚至可以帮助纠正瑞士 - Prot 等高质量资源中至少 16%的遗留注释错误。这些发现表明 HFSP 是大规模功能注释工作的理想资源。

补充信息

补充数据可在“Bioinformatics”在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7878/6022561/654f6e7a1f3d/bty262f1.jpg

相似文献

HFSP: high speed homology-driven function annotation of proteins.HFSP：高速同源驱动的蛋白质功能注释。

Bioinformatics. 2018 Jul 1;34(13):i304-i312. doi: 10.1093/bioinformatics/bty262.

AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints.AlignBucket：一种用于加速“全对全”蛋白质序列比对并优化长度限制的工具。

Bioinformatics. 2015 Dec 1;31(23):3841-3. doi: 10.1093/bioinformatics/btv451. Epub 2015 Jul 30.

xHMMER3x2: Utilizing HMMER3's speed and HMMER2's sensitivity and specificity in the glocal alignment mode for improved large-scale protein domain annotation.xHMMER3x2：在全局局部比对模式下利用HMMER3的速度以及HMMER2的灵敏度和特异性，以改进大规模蛋白质结构域注释。

Biol Direct. 2016 Nov 29;11(1):63. doi: 10.1186/s13062-016-0163-0.

Improvements in viral gene annotation using large language models and soft alignments.利用大型语言模型和软对齐技术改进病毒基因注释。

BMC Bioinformatics. 2024 Apr 25;25(1):165. doi: 10.1186/s12859-024-05779-6.

Ultra-fast global homology detection with Discrete Cosine Transform and Dynamic Time Warping.基于离散余弦变换和动态时间规整的超快速全局同源检测。

Bioinformatics. 2018 Sep 15;34(18):3118-3125. doi: 10.1093/bioinformatics/bty309.

Information theory applied to the sparse gene ontology annotation network to predict novel gene function.信息论应用于稀疏基因本体注释网络以预测新的基因功能。

Bioinformatics. 2007 Jul 1;23(13):i529-38. doi: 10.1093/bioinformatics/btm195.

LEON-BIS: multiple alignment evaluation of sequence neighbours using a Bayesian inference system.LEON-BIS：使用贝叶斯推理系统对序列邻域进行多重比对评估。

BMC Bioinformatics. 2016 Jul 7;17(1):271. doi: 10.1186/s12859-016-1146-y.

Mining sequence annotation databanks for association patterns.挖掘序列注释数据库中的关联模式。

Bioinformatics. 2005 Nov 1;21 Suppl 3:iii49-57. doi: 10.1093/bioinformatics/bti1206.

Fuse: multiple network alignment via data fusion.Fuse：通过数据融合进行多重网络比对。

Bioinformatics. 2016 Apr 15;32(8):1195-203. doi: 10.1093/bioinformatics/btv731. Epub 2015 Dec 14.

Prediction of protein subcellular localization.蛋白质亚细胞定位预测

Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

引用本文的文献

De novo discovery of conserved gene clusters in microbial genomes with Spacedust.利用Spacedust在微生物基因组中从头发现保守基因簇。

Nat Methods. 2025 Sep 15. doi: 10.1038/s41592-025-02816-x.

PLMSearch and PLMAlign: Protein Language Model (PLM)-Based Homologous Protein Sequence Search and Alignment.PLMSearch和PLMAlign：基于蛋白质语言模型（PLM）的同源蛋白质序列搜索与比对

Methods Mol Biol. 2025;2941:227-241. doi: 10.1007/978-1-0716-4623-6_14.

Cutting-edge deep-learning based tools for metagenomic research.用于宏基因组学研究的前沿深度学习工具。

Natl Sci Rev. 2025 Feb 19;12(6):nwaf056. doi: 10.1093/nsr/nwaf056. eCollection 2025 Jun.

Activity-based protein profiling reveals both canonical and novel ubiquitin pathway enzymes in Plasmodium.基于活性的蛋白质谱分析揭示了疟原虫中经典和新型泛素途径酶。

PLoS Pathog. 2025 Apr 18;21(4):e1013032. doi: 10.1371/journal.ppat.1013032. eCollection 2025 Apr.

Biological databases in the age of generative artificial intelligence.生成式人工智能时代的生物数据库。

Bioinform Adv. 2025 Mar 20;5(1):vbaf044. doi: 10.1093/bioadv/vbaf044. eCollection 2025.

Engineering a bacterial toxin deaminase from the DYW-family into a novel cytosine base editor for plants and mammalian cells.将来自DYW家族的细菌毒素脱氨酶改造为一种用于植物和哺乳动物细胞的新型胞嘧啶碱基编辑器。

Genome Biol. 2025 Feb 3;26(1):18. doi: 10.1186/s13059-025-03478-w.

Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools.序列储备的功能分析：基于蛋白质对的计算机预测工具评估

Bioinformatics. 2025 Feb 4;41(2). doi: 10.1093/bioinformatics/btaf035.

Evaluation of Three Mutations in Codon 385 of Glucose-6-Phosphate Dehydrogenase via Biochemical and In Silico Analysis.通过生化和计算机模拟分析评估葡萄糖-6-磷酸脱氢酶第385密码子的三种突变

Int J Mol Sci. 2024 Nov 22;25(23):12556. doi: 10.3390/ijms252312556.

Assembling bacterial puzzles: piecing together functions into microbial pathways.组装细菌谜题：将功能拼凑成微生物途径。

NAR Genom Bioinform. 2024 Aug 24;6(3):lqae109. doi: 10.1093/nargab/lqae109. eCollection 2024 Sep.

A large-scale assessment of sequence database search tools for homology-based protein function prediction.基于序列数据库搜索工具的大规模评估用于同源蛋白功能预测。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae349.

本文引用的文献

fusionDB: assessing microbial diversity and environmental preferences via functional similarity networks.fusionDB：通过功能相似性网络评估微生物多样性和环境偏好。

Nucleic Acids Res. 2018 Jan 4;46(D1):D535-D541. doi: 10.1093/nar/gkx1060.

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索，以分析海量数据集。

Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.

BRENDA in 2017: new perspectives and new tools in BRENDA.2017年的BRENDA：BRENDA中的新视角与新工具。

Nucleic Acids Res. 2017 Jan 4;45(D1):D380-D388. doi: 10.1093/nar/gkw952. Epub 2016 Oct 19.

UniProt: the universal protein knowledgebase.通用蛋白质知识库：UniProt

Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. Epub 2016 Nov 29.

An expanded evaluation of protein function prediction methods shows an improvement in accuracy.对蛋白质功能预测方法的扩展评估显示准确性有所提高。

Genome Biol. 2016 Sep 7;17(1):184. doi: 10.1186/s13059-016-1037-6.

Functional Basis of Microorganism Classification.微生物分类的功能基础。

PLoS Comput Biol. 2015 Aug 28;11(8):e1004472. doi: 10.1371/journal.pcbi.1004472. eCollection 2015 Aug.

Genome-scale identification and characterization of moonlighting proteins.兼职蛋白的全基因组规模鉴定与表征

Biol Direct. 2014 Dec 11;9:30. doi: 10.1186/s13062-014-0030-9.

The role of balanced training and testing data sets for binary classifiers in bioinformatics.生物信息学中用于二分类器的平衡训练集和测试集的作用。

PLoS One. 2013 Jul 9;8(7):e67863. doi: 10.1371/journal.pone.0067863. Print 2013.

Compressive genomics for protein databases.基于压缩的基因组学蛋白质数据库。

Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.

A large-scale evaluation of computational protein function prediction.大规模计算蛋白质功能预测评估。

Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

HFSP：高速同源驱动的蛋白质功能注释。

HFSP: high speed homology-driven function annotation of proteins.

机构信息

出版信息

MOTIVATION

RESULTS

SUPPLEMENTARY INFORMATION

动机

结果

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献