Suppr超能文献

用于大规模生物数据集的并行聚类算法。

Parallel clustering algorithm for large-scale biological data sets.

作者信息

Wang Minchao, Zhang Wu, Ding Wang, Dai Dongbo, Zhang Huiran, Xie Hao, Chen Luonan, Guo Yike, Xie Jiang

机构信息

School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.

School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; High Performance Computing Center, Shanghai University, Shanghai, P.R.China.

出版信息

PLoS One. 2014 Apr 4;9(4):e91315. doi: 10.1371/journal.pone.0091315. eCollection 2014.

Abstract

BACKGROUNDS

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

METHODS

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

RESULT

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

摘要

背景

近期生物数据的爆炸式增长给传统聚类算法带来了巨大挑战。随着数据集规模的不断增大,聚类识别问题需要更大的内存和更长的运行时间。亲和传播算法优于许多其他经典聚类算法,并广泛应用于生物学研究中。然而,在处理大规模数据集时,时间和空间复杂度成为了一个巨大的瓶颈。此外,由于亲和传播算法是基于数据对之间的相似度对数据集进行聚类,因此在运行该算法之前需要构建相似度矩阵,而构建过程需要较长的运行时间。

方法

本文提出了两种并行架构来加速相似度矩阵的构建过程和亲和传播算法。由于内存共享架构具有较大的内存容量和强大的计算能力,因此用于构建相似度矩阵,而分布式系统则用于亲和传播算法。我们的方法设计了一种合适的数据分区和归约方式,以最小化进程间的全局通信成本。

结果

使用128个核心实现了100倍的加速。运行时间从几个小时缩短到了几秒钟,这表明并行算法能够有效地处理大规模数据集。并行亲和传播算法在对大规模基因数据(微阵列)进行聚类和检测大型蛋白质超家族中的家族时也取得了良好的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b9f/3976248/2dd2b19bc3bb/pone.0091315.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验