用于大规模生物数据集的并行聚类算法。

Parallel clustering algorithm for large-scale biological data sets.

作者信息

Wang Minchao, Zhang Wu, Ding Wang, Dai Dongbo, Zhang Huiran, Xie Hao, Chen Luonan, Guo Yike, Xie Jiang

机构信息

School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.

School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; High Performance Computing Center, Shanghai University, Shanghai, P.R.China.

出版信息

PLoS One. 2014 Apr 4;9(4):e91315. doi: 10.1371/journal.pone.0091315. eCollection 2014.

DOI:10.1371/journal.pone.0091315

PMID:24705246

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3976248/

Abstract

BACKGROUNDS

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

METHODS

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

RESULT

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

摘要

背景

近期生物数据的爆炸式增长给传统聚类算法带来了巨大挑战。随着数据集规模的不断增大，聚类识别问题需要更大的内存和更长的运行时间。亲和传播算法优于许多其他经典聚类算法，并广泛应用于生物学研究中。然而，在处理大规模数据集时，时间和空间复杂度成为了一个巨大的瓶颈。此外，由于亲和传播算法是基于数据对之间的相似度对数据集进行聚类，因此在运行该算法之前需要构建相似度矩阵，而构建过程需要较长的运行时间。

方法

本文提出了两种并行架构来加速相似度矩阵的构建过程和亲和传播算法。由于内存共享架构具有较大的内存容量和强大的计算能力，因此用于构建相似度矩阵，而分布式系统则用于亲和传播算法。我们的方法设计了一种合适的数据分区和归约方式，以最小化进程间的全局通信成本。

结果

使用128个核心实现了100倍的加速。运行时间从几个小时缩短到了几秒钟，这表明并行算法能够有效地处理大规模数据集。并行亲和传播算法在对大规模基因数据（微阵列）进行聚类和检测大型蛋白质超家族中的家族时也取得了良好的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b9f/3976248/2dd2b19bc3bb/pone.0091315.g001.jpg

相似文献

Parallel clustering algorithm for large-scale biological data sets.用于大规模生物数据集的并行聚类算法。

PLoS One. 2014 Apr 4;9(4):e91315. doi: 10.1371/journal.pone.0091315. eCollection 2014.

Parallel clustering algorithm for large data sets with applications in bioinformatics.用于大数据集的并行聚类算法及其在生物信息学中的应用

IEEE/ACM Trans Comput Biol Bioinform. 2009 Apr-Jun;6(2):344-52. doi: 10.1109/TCBB.2007.70272.

Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。

Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.

Markov clustering versus affinity propagation for the partitioning of protein interaction graphs.用于蛋白质相互作用图划分的马尔可夫聚类与亲和传播算法

BMC Bioinformatics. 2009 Mar 30;10:99. doi: 10.1186/1471-2105-10-99.

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.使用功能类别参考集评估基因表达数据聚类算法的方法。

BMC Bioinformatics. 2006 Aug 31;7:397. doi: 10.1186/1471-2105-7-397.

Tight clustering for large datasets with an application to gene expression data.针对大型数据集的紧密聚类及其在基因表达数据中的应用。

Sci Rep. 2019 Feb 28;9(1):3053. doi: 10.1038/s41598-019-39459-w.

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.一种用于大规模蛋白质序列数据集的快速分层聚类算法。

Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.HipMCL：一种用于大规模网络的马尔可夫聚类算法的高性能并行实现。

Nucleic Acids Res. 2018 Apr 6;46(6):e33. doi: 10.1093/nar/gkx1313.

Detection of protein complexes from affinity purification/mass spectrometry data.从亲和纯化/质谱数据中检测蛋白质复合物。

BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S4. doi: 10.1186/1752-0509-6-S3-S4. Epub 2012 Dec 17.

Graph-based unsupervised feature selection and multiview clustering for microarray data.基于图的无监督特征选择与微阵列数据的多视图聚类

J Biosci. 2015 Oct;40(4):755-67. doi: 10.1007/s12038-015-9559-8.

引用本文的文献

Clustering Molecules at a Large Scale: Integrating Spectral Geometry with Deep Learning.大规模分子聚类：将光谱几何与深度学习相结合

Molecules. 2024 Aug 17;29(16):3902. doi: 10.3390/molecules29163902.

A comprehensive review of machine learning algorithms and their application in geriatric medicine: present and future.机器学习算法及其在老年医学中的应用的全面综述：现状与未来。

Aging Clin Exp Res. 2023 Nov;35(11):2363-2397. doi: 10.1007/s40520-023-02552-2. Epub 2023 Sep 8.

Applications of Community Detection Algorithms to Large Biological Datasets.社区检测算法在大型生物数据集上的应用。

Methods Mol Biol. 2021;2243:59-80. doi: 10.1007/978-1-0716-1103-6_3.

paraGSEA: a scalable approach for large-scale gene expression profiling.并行基因集富集分析（paraGSEA）：一种用于大规模基因表达谱分析的可扩展方法。

Nucleic Acids Res. 2017 Sep 29;45(17):e155. doi: 10.1093/nar/gkx679.

PLoS One. 2017 Jul 7;12(7):e0180307. doi: 10.1371/journal.pone.0180307. eCollection 2017.

Machine learning for biomedical literature triage.用于生物医学文献分类的机器学习

PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014.

本文引用的文献

CNNcon: improved protein contact maps prediction using cascaded neural networks.CNNcon：使用级联神经网络改进蛋白质接触图预测。

PLoS One. 2013 Apr 23;8(4):e61533. doi: 10.1371/journal.pone.0061533. Print 2013.

Clustering of High Throughput Gene Expression Data.高通量基因表达数据的聚类

Comput Oper Res. 2012 Dec;39(12):3046-3061. doi: 10.1016/j.cor.2012.03.008.

Activity-aware clustering of high throughput screening data and elucidation of orthogonal structure-activity relationships.基于活性的高通量筛选数据聚类和正交构效关系解析。

J Chem Inf Model. 2011 Dec 27;51(12):3158-68. doi: 10.1021/ci2004994. Epub 2011 Dec 7.

Transcriptome alterations in maternal and fetal cells induced by tobacco smoke.烟草烟雾诱导的母胎细胞转录组改变。

Placenta. 2011 Oct;32(10):763-70. doi: 10.1016/j.placenta.2011.06.022. Epub 2011 Jul 30.

Using affinity propagation for identifying subspecies among clonal organisms: lessons from M. tuberculosis.利用亲和传播识别克隆生物中的亚种：结核分枝杆菌的启示。

BMC Bioinformatics. 2011 Jun 2;12:224. doi: 10.1186/1471-2105-12-224.

Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format.使用 GPU 上的大规模并行计算和 CUDA 以及 ELLPACK-R 稀疏格式进行生物信息学中的快速并行马尔可夫聚类。

IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):679-92. doi: 10.1109/TCBB.2011.68.

Evolution of metabolic network organization.代谢网络组织的演变

BMC Syst Biol. 2010 May 11;4:59. doi: 10.1186/1752-0509-4-59.

Parallel spectral clustering in distributed systems.分布式系统中的并行谱聚类。

IEEE Trans Pattern Anal Mach Intell. 2011 Mar;33(3):568-86. doi: 10.1109/TPAMI.2010.88.

Statistical Power Calculations for Clustered Continuous Data.聚类连续数据的统计功效计算

Int J Knowl Eng Soft Data Paradig. 2009 Jan 1;1(1):40-48. doi: 10.1504/IJKESDP.2009.021983.

Temporal clustering by affinity propagation reveals transcriptional modules in Arabidopsis thaliana.亲缘传播的时间聚类揭示了拟南芥中的转录模块。

Bioinformatics. 2010 Feb 1;26(3):355-62. doi: 10.1093/bioinformatics/btp673. Epub 2009 Dec 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于大规模生物数据集的并行聚类算法。

Parallel clustering algorithm for large-scale biological data sets.

作者信息

机构信息

出版信息

BACKGROUNDS

METHODS

RESULT

背景

方法

结果

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献