使用基于MapReduce的高效K近邻分类器分析微阵列白血病数据。

Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier.

作者信息

Kumar Mukesh, Rath Nitish Kumar, Rath Santanu Kumar

机构信息

Department of Computer Science and Engineering, NIT Rourkela, Orissa 769008, India.

出版信息

J Biomed Inform. 2016 Apr;60:395-409. doi: 10.1016/j.jbi.2016.03.002. Epub 2016 Mar 11.

DOI:10.1016/j.jbi.2016.03.002

PMID:26975600

Abstract

Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data.

摘要

基于微阵列的基因表达谱分析已成为一种用于癌症分类、预后评估、诊断和治疗的有效技术。这种疾病行为的频繁变化产生了大量数据。微阵列数据满足大数据的真实性和速度特性，因为它会随时间不断变化。因此，在短时间内分析微阵列数据集至关重要。它们通常包含大量表达数据，但其中只有一小部分包含显著表达的基因。在微阵列数据分析中，准确识别导致癌症的感兴趣基因至关重要。大多数现有方案采用两阶段过程，如特征选择/提取，然后进行分类。本文提出了基于MapReduce的各种统计方法（测试）来选择相关特征。在特征选择之后，还采用基于MapReduce的K近邻（mrKNN）分类器对微阵列数据进行分类。这些算法在Hadoop框架中成功实现。使用各种维度的微阵列数据集对这些基于MapReduce的模型进行了比较分析。从获得的结果可以看出，在处理大数据时，这些模型比传统模型消耗的执行时间要少得多。

相似文献

Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier.使用基于MapReduce的高效K近邻分类器分析微阵列白血病数据。

J Biomed Inform. 2016 Apr;60:395-409. doi: 10.1016/j.jbi.2016.03.002. Epub 2016 Mar 11.

Chaotic genetic algorithm for gene selection and classification problems.用于基因选择与分类问题的混沌遗传算法。

OMICS. 2009 Oct;13(5):407-20. doi: 10.1089/omi.2009.0007.

Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis.在微阵列数据分析中从主成分分析（PCA）和偏最小二乘法（PLS）中选择新提取特征的子集。

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S24. doi: 10.1186/1471-2164-9-S2-S24.

Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

Quadratic regression analysis for gene discovery and pattern recognition for non-cyclic short time-course microarray experiments.用于非循环短时间进程微阵列实验的基因发现和模式识别的二次回归分析。

BMC Bioinformatics. 2005 Apr 25;6:106. doi: 10.1186/1471-2105-6-106.

Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification.用于癌症微阵列数据分类的分层基因选择与遗传模糊系统

PLoS One. 2015 Mar 30;10(3):e0120364. doi: 10.1371/journal.pone.0120364. eCollection 2015.

Classification of microarray data with factor mixture models.基于因子混合模型的微阵列数据分类

Bioinformatics. 2006 Jan 15;22(2):202-8. doi: 10.1093/bioinformatics/bti779. Epub 2005 Nov 15.

Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis.用于微阵列数据分析的具有准确且紧凑模糊规则库的可解释基因表达分类器。

Biosystems. 2006 Sep;85(3):165-76. doi: 10.1016/j.biosystems.2006.01.002. Epub 2006 Feb 21.

A hybrid BPSO-CGA approach for gene selection and classification of microarray data.一种用于基因选择和微阵列数据分类的混合BPSO-CGA方法。

J Comput Biol. 2012 Jan;19(1):68-82. doi: 10.1089/cmb.2010.0064. Epub 2011 Jan 6.

A hybrid feature selection method for DNA microarray data.一种用于 DNA 微阵列数据的混合特征选择方法。

Comput Biol Med. 2011 Apr;41(4):228-37. doi: 10.1016/j.compbiomed.2011.02.004. Epub 2011 Mar 3.

引用本文的文献

A Dual Level Analysis with Evolutionary Computing and Swarm Models for Classification of Leukemia.基于进化计算和群集模型的白血病分类双层分析。

Biomed Res Int. 2022 May 26;2022:2052061. doi: 10.1155/2022/2052061. eCollection 2022.

Inference of Large-scale Time-delayed Gene Regulatory Network with Parallel MapReduce Cloud Platform.基于并行 MapReduce 云平台的大规模时滞基因调控网络推断。

Sci Rep. 2018 Dec 12;8(1):17787. doi: 10.1038/s41598-018-36180-y.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用基于MapReduce的高效K近邻分类器分析微阵列白血病数据。

Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献