Suppr超能文献

利用 PSSM 家族模型缩小搜索空间,大大提高了 HMM 对数据库的搜索速度。

Significant speedup of database searches with HMMs by search space reduction with PSSM family models.

机构信息

Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.

出版信息

Bioinformatics. 2009 Dec 15;25(24):3251-8. doi: 10.1093/bioinformatics/btp593. Epub 2009 Oct 14.

Abstract

MOTIVATION

Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive.

RESULTS

We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92.

AVAILABILITY

The presented algorithms are implemented in the program PoSSuMsearch2, available for download at http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/.

CONTACT

beckstette@zbh.uni-hamburg.de

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

目前,轮廓隐马尔可夫模型(pHMM)是蛋白质家族最流行的建模概念。它们提供了敏感的家族描述符,并且使用 pHMM 进行序列数据库搜索已经成为当今基因组注释管道中的标准任务。缺点是,使用 pHMM 进行搜索计算成本很高。

结果

我们提出了一种新的方法,用于有效地对蛋白质家族进行分类,并加快 pHMM 的数据库搜索速度,这对于大规模分析场景是必要的。我们使用称为位置特定评分矩阵家族模型(PSSM-FM)的蛋白质家族更简单的模型。对于快速数据库搜索,我们结合全文索引、高效的 PSSM 匹配得分精确 p 值计算和快速片段链接。由此产生的方法非常适合预过滤要搜索的序列集,以便随后使用 pHMM 进行数据库搜索。我们的方法仅在分类性能上略逊于 hmmsearch,但可以在运行时间的一小部分内获得结果,加速比超过 64 倍。在针对该方法在随后使用 pHMM 进行数据库搜索时对序列空间进行预过滤的能力的实验中,我们的方法将需要用 hmmsearch 搜索的序列数量减少到所有序列的仅 0.80%。该过滤器非常快,总加速比为无过滤器搜索的 43 倍,同时保留了>99.5%的原始结果。在针对 UniProtKB/Swiss-Prot 的 hmmsearch 的无损过滤器设置中,我们观察到加速比为 92 倍。

可用性

所提出的算法已在程序 PoSSuMsearch2 中实现,可从 http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/ 下载。

联系方式

beckstette@zbh.uni-hamburg.de

补充信息

补充数据可在 Bioinformatics 在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17a3/2788931/7ee156a4e7f0/btp593f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验