FastaHerder2：利用聚类和聚类数据库研究蛋白质功能与进化的四种方法。

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

作者信息

Mier Pablo, Andrade-Navarro Miguel A

机构信息

1 Faculty of Biology, Johannes Gutenberg University Mainz , Mainz, Germany .

2 Institute of Molecular Biology , Mainz, Germany .

出版信息

J Comput Biol. 2016 Apr;23(4):270-8. doi: 10.1089/cmb.2015.0191. Epub 2016 Feb 1.

DOI:10.1089/cmb.2015.0191

PMID:26828375

Abstract

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50 reference proteomes, along with the SwissProt database, which we could compress by 74.7%. The clustering algorithm was benchmarked using OrthoBench and compared with FASTA HERDER, a previous version of the algorithm, showing that FastaHerder2 can cluster a set of proteins yielding a high compression, with a lower error rate than its predecessor. We illustrate the use of FastaHerder2 to detect biologically relevant functional features in protein families. With our approach we seek to promote a modern view and usage of the protein sequence databases more appropriate to the postgenomic era.

摘要

蛋白质数据库的加速增长为利用序列相似性和保守性研究蛋白质功能提供了巨大可能性。然而，这些数据库中存入的大量序列需要新的数据分析和组织方式。有必要对众多非常相似的序列进行分组，创建具有自动衍生注释的簇，以有助于理解它们的功能、进化和实验证据水平。我们开发了一种名为FastaHerder2的算法，它可以对任何蛋白质数据库进行聚类，基于近乎全长的相似性和/或高序列同一性阈值将非常相似的蛋白质序列组合在一起。我们压缩了50个参考蛋白质组以及SwissProt数据库，SwissProt数据库的压缩率达到了74.7%。使用OrthoBench对聚类算法进行了基准测试，并与该算法的前一版本FASTA HERDER进行了比较，结果表明FastaHerder2能够对一组蛋白质进行聚类，实现高压缩率，且错误率低于其前身。我们举例说明了FastaHerder2在检测蛋白质家族中生物学相关功能特征方面的应用。通过我们的方法，我们试图推广一种更适合后基因组时代的蛋白质序列数据库的现代观点和使用方式。

相似文献

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.FastaHerder2：利用聚类和聚类数据库研究蛋白质功能与进化的四种方法。

J Comput Biol. 2016 Apr;23(4):270-8. doi: 10.1089/cmb.2015.0191. Epub 2016 Feb 1.

Incremental generation of summarized clustering hierarchy for protein family analysis.用于蛋白质家族分析的汇总聚类层次结构的增量生成。

Bioinformatics. 2004 Nov 1;20(16):2586-96. doi: 10.1093/bioinformatics/bth290. Epub 2004 May 6.

A functional hierarchical organization of the protein sequence space.蛋白质序列空间的功能层次组织。

BMC Bioinformatics. 2004 Dec 14;5:196. doi: 10.1186/1471-2105-5-196.

A novel approach for clustering proteomics data using Bayesian fast Fourier transform.一种使用贝叶斯快速傅里叶变换对蛋白质组学数据进行聚类的新方法。

Bioinformatics. 2005 May 15;21(10):2210-24. doi: 10.1093/bioinformatics/bti383. Epub 2005 Mar 15.

Detection of orphan domains in Drosophila using "hydrophobic cluster analysis".利用“疏水簇分析”检测果蝇中的孤儿结构域

Biochimie. 2015 Dec;119:244-53. doi: 10.1016/j.biochi.2015.02.019. Epub 2015 Feb 28.

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques.Cluster-C，一种基于最大团提取的蛋白质序列大规模聚类算法。

Comput Biol Chem. 2004 Jul;28(3):211-8. doi: 10.1016/j.compbiolchem.2004.03.002.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Revealing remote protein homology with sequence similarity and a modularity-based approach.通过序列相似性和基于模块性的方法揭示远程蛋白质同源性。

Theor Biol Forum. 2011;104(1):57-68.

Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization.Hum-PLoc：一种用于预测人类蛋白质亚细胞定位的新型集成分类器。

Biochem Biophys Res Commun. 2006 Aug 18;347(1):150-7. doi: 10.1016/j.bbrc.2006.06.059. Epub 2006 Jun 21.

Toward completion of the Earth's proteome: an update a decade later.向着完成地球蛋白质组学的目标迈进：十年后的更新。

Brief Bioinform. 2019 Mar 22;20(2):463-470. doi: 10.1093/bib/bbx127.

引用本文的文献

Geometric characterisation of disease modules.疾病模块的几何特征描述。

Appl Netw Sci. 2018;3(1):10. doi: 10.1007/s41109-018-0066-3. Epub 2018 Jun 18.

Manifold learning and maximum likelihood estimation for hyperbolic network embedding.用于双曲网络嵌入的流形学习与最大似然估计

Appl Netw Sci. 2016;1(1):10. doi: 10.1007/s41109-016-0013-0. Epub 2016 Nov 15.

The latent geometry of the human protein interaction network.人类蛋白质相互作用网络的潜在几何结构。

Bioinformatics. 2018 Aug 15;34(16):2826-2834. doi: 10.1093/bioinformatics/bty206.

Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length.谷氨酰胺密码子使用和灵长类动物中的 polyQ 进化依赖于 Q 延伸长度。

Genome Biol Evol. 2018 Mar 1;10(3):816-825. doi: 10.1093/gbe/evy046.

The Protein Structure Context of PolyQ Regions.多聚谷氨酰胺区域的蛋白质结构背景

PLoS One. 2017 Jan 26;12(1):e0170801. doi: 10.1371/journal.pone.0170801. eCollection 2017.

dAPE: a web server to detect homorepeats and follow their evolution.dAPE：一个用于检测同聚物重复序列并追踪其进化的网络服务器。

Bioinformatics. 2017 Apr 15;33(8):1221-1223. doi: 10.1093/bioinformatics/btw790.

Efficient embedding of complex networks to hyperbolic space via their Laplacian.通过拉普拉斯算子将复杂网络高效嵌入双曲空间。

Sci Rep. 2016 Jul 22;6:30108. doi: 10.1038/srep30108.

CABRA: Cluster and Annotate Blast Results Algorithm.CABRA：聚类与注释Blast结果算法

BMC Res Notes. 2016 Apr 30;9:253. doi: 10.1186/s13104-016-2062-y.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

FastaHerder2：利用聚类和聚类数据库研究蛋白质功能与进化的四种方法。

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献