一种用于大规模蛋白质序列数据集的快速分层聚类算法。

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.

作者信息

Szilágyi Sándor M, Szilágyi László

机构信息

Petru Maior University, Department of Informatics, Str. Nicolae Iorga Nr. 1, 540088 Tîrgu Mureş, Romania.

Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary; Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania.

出版信息

Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

DOI:10.1016/j.compbiomed.2014.02.016

PMID:24657908

Abstract

TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1min in the case of the 11,944proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm׳s parameter values.

摘要

TRIBE-MCL是一种马尔可夫聚类算法，它在根据输入数据的成对相似性信息构建的图上运行。存储在随机相似性矩阵中的边权重被交替输入到两个主要操作，即膨胀和扩展，并在每个主循环中进行归一化以维持概率约束。在本文中，我们提出了一种TRIBE-MCL聚类算法的高效实现，适用于对蛋白质序列进行快速准确的分组。引入了一种改进的稀疏矩阵结构，它可以有效地处理主循环的大多数操作。利用相似性矩阵的对称性，还引入了一种快速矩阵平方公式以促进耗时的扩展。所提出的算法在诸如SCOP95等蛋白质序列数据库上进行了测试。在效率方面，与最近发表的高效解决方案相比，所提出的解决方案将执行速度提高了两个数量级，在SCOP95的11,944个蛋白质的情况下，将总运行时间减少到远低于1分钟。在不损失分区质量的情况下实现了计算时间的这种改进。收敛通常在大约50次迭代中达到。高效的执行使我们能够对分类结果进行全面评估，并就算法参数值的选择制定建议。

相似文献

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.一种用于大规模蛋白质序列数据集的快速分层聚类算法。

Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

Efficient Markov clustering algorithm for protein sequence grouping.用于蛋白质序列分组的高效马尔可夫聚类算法。

Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:639-42. doi: 10.1109/EMBC.2013.6609581.

A modified two-stage Markov clustering algorithm for large and sparse networks.一种适用于大型稀疏网络的改进型两阶段马尔可夫聚类算法。

Comput Methods Programs Biomed. 2016 Oct;135:15-26. doi: 10.1016/j.cmpb.2016.07.007. Epub 2016 Jul 12.

Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm.用于通过马尔可夫聚类算法进行同源性推断的基于BLAST的边加权指标评估。

BMC Bioinformatics. 2015 Jul 10;16:218. doi: 10.1186/s12859-015-0625-x.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

[Prediction of protein solvent accessibility with Markov chain model].[基于马尔可夫链模型的蛋白质溶剂可及性预测]

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2006 Oct;23(5):1109-13.

Incremental generation of summarized clustering hierarchy for protein family analysis.用于蛋白质家族分析的汇总聚类层次结构的增量生成。

Bioinformatics. 2004 Nov 1;20(16):2586-96. doi: 10.1093/bioinformatics/bth290. Epub 2004 May 6.

Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format.使用 GPU 上的大规模并行计算和 CUDA 以及 ELLPACK-R 稀疏格式进行生物信息学中的快速并行马尔可夫聚类。

IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):679-92. doi: 10.1109/TCBB.2011.68.

Markov clustering versus affinity propagation for the partitioning of protein interaction graphs.用于蛋白质相互作用图划分的马尔可夫聚类与亲和传播算法

BMC Bioinformatics. 2009 Mar 30;10:99. doi: 10.1186/1471-2105-10-99.

Efficient layered density-based clustering of categorical data.分类数据的高效分层基于密度的聚类

J Biomed Inform. 2009 Apr;42(2):365-76. doi: 10.1016/j.jbi.2008.11.004. Epub 2008 Dec 10.

引用本文的文献

Exploring Large Protein Sequence Space through Homology- and Representation-based Hierarchical Clustering.通过基于同源性和表示的层次聚类探索大型蛋白质序列空间。

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf136.

Genome-Enhanced Detection and Identification (GEDI) of plant pathogens.植物病原体的基因组增强检测与鉴定（GEDI）

PeerJ. 2018 Feb 22;6:e4392. doi: 10.7717/peerj.4392. eCollection 2018.

Improved multi-objective clustering algorithm using particle swarm optimization.基于粒子群优化的改进多目标聚类算法。

PLoS One. 2017 Dec 5;12(12):e0188815. doi: 10.1371/journal.pone.0188815. eCollection 2017.

BMC Bioinformatics. 2015 Jul 10;16:218. doi: 10.1186/s12859-015-0625-x.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于大规模蛋白质序列数据集的快速分层聚类算法。

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献