Suppr超能文献

一种用于大规模蛋白质序列数据集的快速分层聚类算法。

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.

作者信息

Szilágyi Sándor M, Szilágyi László

机构信息

Petru Maior University, Department of Informatics, Str. Nicolae Iorga Nr. 1, 540088 Tîrgu Mureş, Romania.

Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary; Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania.

出版信息

Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

Abstract

TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1min in the case of the 11,944proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm׳s parameter values.

摘要

TRIBE-MCL是一种马尔可夫聚类算法,它在根据输入数据的成对相似性信息构建的图上运行。存储在随机相似性矩阵中的边权重被交替输入到两个主要操作,即膨胀和扩展,并在每个主循环中进行归一化以维持概率约束。在本文中,我们提出了一种TRIBE-MCL聚类算法的高效实现,适用于对蛋白质序列进行快速准确的分组。引入了一种改进的稀疏矩阵结构,它可以有效地处理主循环的大多数操作。利用相似性矩阵的对称性,还引入了一种快速矩阵平方公式以促进耗时的扩展。所提出的算法在诸如SCOP95等蛋白质序列数据库上进行了测试。在效率方面,与最近发表的高效解决方案相比,所提出的解决方案将执行速度提高了两个数量级,在SCOP95的11,944个蛋白质的情况下,将总运行时间减少到远低于1分钟。在不损失分区质量的情况下实现了计算时间的这种改进。收敛通常在大约50次迭代中达到。高效的执行使我们能够对分类结果进行全面评估,并就算法参数值的选择制定建议。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验