Suppr超能文献

FastaHerder2:利用聚类和聚类数据库研究蛋白质功能与进化的四种方法。

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

作者信息

Mier Pablo, Andrade-Navarro Miguel A

机构信息

1 Faculty of Biology, Johannes Gutenberg University Mainz , Mainz, Germany .

2 Institute of Molecular Biology , Mainz, Germany .

出版信息

J Comput Biol. 2016 Apr;23(4):270-8. doi: 10.1089/cmb.2015.0191. Epub 2016 Feb 1.

Abstract

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50 reference proteomes, along with the SwissProt database, which we could compress by 74.7%. The clustering algorithm was benchmarked using OrthoBench and compared with FASTA HERDER, a previous version of the algorithm, showing that FastaHerder2 can cluster a set of proteins yielding a high compression, with a lower error rate than its predecessor. We illustrate the use of FastaHerder2 to detect biologically relevant functional features in protein families. With our approach we seek to promote a modern view and usage of the protein sequence databases more appropriate to the postgenomic era.

摘要

蛋白质数据库的加速增长为利用序列相似性和保守性研究蛋白质功能提供了巨大可能性。然而,这些数据库中存入的大量序列需要新的数据分析和组织方式。有必要对众多非常相似的序列进行分组,创建具有自动衍生注释的簇,以有助于理解它们的功能、进化和实验证据水平。我们开发了一种名为FastaHerder2的算法,它可以对任何蛋白质数据库进行聚类,基于近乎全长的相似性和/或高序列同一性阈值将非常相似的蛋白质序列组合在一起。我们压缩了50个参考蛋白质组以及SwissProt数据库,SwissProt数据库的压缩率达到了74.7%。使用OrthoBench对聚类算法进行了基准测试,并与该算法的前一版本FASTA HERDER进行了比较,结果表明FastaHerder2能够对一组蛋白质进行聚类,实现高压缩率,且错误率低于其前身。我们举例说明了FastaHerder2在检测蛋白质家族中生物学相关功能特征方面的应用。通过我们的方法,我们试图推广一种更适合后基因组时代的蛋白质序列数据库的现代观点和使用方式。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验