一种用于对超大数据集进行聚类的高效多核算法。

A highly efficient multi-core algorithm for clustering extremely large datasets.

机构信息

Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.

出版信息

BMC Bioinformatics. 2010 Apr 6;11:169. doi: 10.1186/1471-2105-11-169.

DOI:10.1186/1471-2105-11-169

PMID:20370922

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2865495/

Abstract

BACKGROUND

In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

RESULTS

We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

CONCLUSIONS

Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

摘要

背景

近年来，由于微阵列和其他高通量技术产生的数据集快速增长，计算生物学对计算能力的需求不断增加。这种需求可能还会增加。分析数据的标准算法，如聚类算法，需要并行化以实现快速处理。不幸的是，大多数并行化算法的方法在很大程度上依赖于连接并需要多台计算机的网络通信协议。解决此问题的一种方法是利用当前多核硬件的内在功能，在一台计算机的不同核心之间分配任务。

结果

我们根据事务内存的设计原则，为聚类基因表达微阵列类型数据和分类 SNP 数据引入了 k-均值和 k-模式聚类算法的多核并行化。我们的新共享内存并行算法具有很高的效率。我们展示了它们的计算能力，并通过使用略微更改参数的重复运行来展示它们在聚类稳定性和敏感性分析中的实用性。与单核实现和最近发布的基于网络的并行化相比，我们基于 Java 的算法的计算速度在处理大型数据集时提高了 10 倍，同时保持了计算准确性。

结论

大多数台式计算机甚至笔记本电脑都至少提供双核处理器。我们的多核算法表明，使用现代算法概念，即使在实验室计算机上，也可以实现聚类敏感性和聚类数估计等繁琐任务的并行化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d63d/2865495/9ac931d1857e/1471-2105-11-169-1.jpg

相似文献

A highly efficient multi-core algorithm for clustering extremely large datasets.

BMC Bioinformatics. 2010 Apr 6;11:169. doi: 10.1186/1471-2105-11-169.

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use.

BMC Bioinformatics. 2008 Apr 16;9:200. doi: 10.1186/1471-2105-9-200.

Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer.

BMC Bioinformatics. 2008 Oct 29;9:462. doi: 10.1186/1471-2105-9-462.

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.

Bioinformatics. 2009 May 1;25(9):1152-7. doi: 10.1093/bioinformatics/btp123. Epub 2009 Mar 4.

A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays.

Bioinformatics. 2007 Jun 15;23(12):1459-67. doi: 10.1093/bioinformatics/btm131. Epub 2007 Apr 25.

Stepwise iterative maximum likelihood clustering approach.

BMC Bioinformatics. 2016 Aug 24;17(1):319. doi: 10.1186/s12859-016-1184-5.

Gene microarray data analysis using parallel point-symmetry-based clustering.

Int J Data Min Bioinform. 2015;11(3):277-300. doi: 10.1504/ijdmb.2015.067320.

Graph-based consensus clustering for class discovery from gene expression data.

Bioinformatics. 2007 Nov 1;23(21):2888-96. doi: 10.1093/bioinformatics/btm463. Epub 2007 Sep 14.

A Parallel Architecture for the Partitioning Around Medoids (PAM) Algorithm for Scalable Multi-Core Processor Implementation with Applications in Healthcare.

Sensors (Basel). 2018 Nov 25;18(12):4129. doi: 10.3390/s18124129.

Knowledge-assisted recognition of cluster boundaries in gene expression data.

Artif Intell Med. 2005 Sep-Oct;35(1-2):171-83. doi: 10.1016/j.artmed.2005.02.007.

引用本文的文献

Clustering Algorithms on Low-Power and High-Performance Devices for Edge Computing Environments.

Sensors (Basel). 2021 Aug 10;21(16):5395. doi: 10.3390/s21165395.

Comparative gene-expression profiling of the large cell variant of gastrointestinal marginal-zone B-cell lymphoma.

Sci Rep. 2017 Jul 20;7(1):5963. doi: 10.1038/s41598-017-05116-3.

TraqBio - Flexible Progress Tracking for Core Unit Projects.

PLoS One. 2016 Sep 27;11(9):e0162857. doi: 10.1371/journal.pone.0162857. eCollection 2016.

Scalable linkage-disequilibrium-based selective sweep detection: a performance guide.

Gigascience. 2016 Feb 8;5:7. doi: 10.1186/s13742-016-0114-9. eCollection 2016.

Speeding up the Consensus Clustering methodology for microarray data analysis.

Algorithms Mol Biol. 2011 Jan 14;6(1):1. doi: 10.1186/1748-7188-6-1.

本文引用的文献

K-means-type algorithms: a generalized convergence theorem and characterization of local optimality.

IEEE Trans Pattern Anal Mach Intell. 1984 Jan;6(1):81-7. doi: 10.1109/tpami.1984.4767478.

Using allele sharing distance for detecting human population stratification.

Hum Hered. 2009;68(3):182-91. doi: 10.1159/000224638. Epub 2009 Jun 11.

Genetic analysis of radiation-induced changes in human gene expression.

Nature. 2009 May 28;459(7246):587-91. doi: 10.1038/nature07940. Epub 2009 Apr 6.

A roadmap of clustering algorithms: finding a match for a biomedical application.

Brief Bioinform. 2009 May;10(3):297-314. doi: 10.1093/bib/bbn058. Epub 2009 Feb 24.

SPRINT: a new parallel framework for R.

BMC Bioinformatics. 2008 Dec 29;9:558. doi: 10.1186/1471-2105-9-558.

Regulatory networks define phenotypic classes of human stem cell lines.

Nature. 2008 Sep 18;455(7211):401-5. doi: 10.1038/nature07213. Epub 2008 Aug 24.

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use.

BMC Bioinformatics. 2008 Apr 16;9:200. doi: 10.1186/1471-2105-9-200.

Portraits of breast cancer progression.

BMC Bioinformatics. 2007 Aug 6;8:291. doi: 10.1186/1471-2105-8-291.

Human population structure detection via multilocus genotype clustering.

BMC Genet. 2007 Jun 25;8:34. doi: 10.1186/1471-2156-8-34.

Combining functional and linkage disequilibrium information in the selection of tag SNPs.

Bioinformatics. 2007 Jan 1;23(1):129-31. doi: 10.1093/bioinformatics/btl532. Epub 2006 Oct 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于对超大数据集进行聚类的高效多核算法。

A highly efficient multi-core algorithm for clustering extremely large datasets.

机构信息

Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.