利用微阵列基因表达数据的并行混合特征选择提高癌症类型的分类准确性。

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data.

机构信息

Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India.

, Muscat, Oman.

出版信息

Genes Genomics. 2019 Nov;41(11):1301-1313. doi: 10.1007/s13258-019-00859-x. Epub 2019 Aug 19.

DOI:10.1007/s13258-019-00859-x

PMID:31429008

Abstract

BACKGROUND

Data mining techniques are used to mine unknown knowledge from huge data. Microarray gene expression (MGE) data plays a major role in predicting type of cancer. But as MGE data is huge in volume, applying traditional data mining approaches is time consuming. Hence parallel programming frameworks like Hadoop, Spark and Mahout are necessary to ease the task of computation.

OBJECTIVE

Not all the gene expressions are necessary in prediction, it is very essential to select important genes for improving classification accuracy. So feature selection algorithms are parallelized and executed on Spark framework to eliminate unnecessary genes and identify only predictive genes in very less time without affecting prediction accuracy.

METHODS

Parallelized hybrid feature selection (HFS) method is proposed to serve the purpose. This method includes parallelized correlation feature subset selection followed by rank-based feature selection methods. The selected subset of genes is evaluated using parallel classification algorithms. The accuracy values obtained are compared with existing rank-weight feature selection, parallelized recursive feature selection methods and also with the values obtained by executing parallelized HFS on DistributedWekaSpark.

RESULTS

The classification accuracy obtained with the proposed parallelized HFS method is 97% and 79% for gastric cancer and childhood leukemia respectively. The proposed parallelized HFS method produced ~ 4% to ~ 15% improvement in classification accuracy when compared with previous methods.

CONCLUSION

The results reveal the fact that the proposed parallelized feature selection algorithm is scalable to growing medical data and predicts cancer sub-types in lesser time with higher accuracy.

摘要

背景

数据挖掘技术用于从大量数据中挖掘未知知识。微阵列基因表达 (MGE) 数据在预测癌症类型方面起着重要作用。但是，由于 MGE 数据量庞大，应用传统的数据挖掘方法非常耗时。因此，需要使用并行编程框架（如 Hadoop、Spark 和 Mahout）来减轻计算任务的负担。

目的

并非所有基因表达都对预测有必要，选择重要基因对于提高分类准确性非常重要。因此，将特征选择算法并行化并在 Spark 框架上执行，以在不影响预测准确性的情况下，在更短的时间内消除不必要的基因并仅识别出有预测能力的基因。

方法

提出了一种并行化混合特征选择 (HFS) 方法来实现这一目标。该方法包括并行化相关特征子集选择，然后是基于排名的特征选择方法。使用并行分类算法评估选择的基因子集。将获得的准确性值与现有的排名加权特征选择、并行递归特征选择方法进行比较，并与在 DistributedWekaSpark 上执行并行化 HFS 获得的值进行比较。

结果

所提出的并行化 HFS 方法对胃癌和儿童白血病的分类准确率分别为 97%和 79%。与以前的方法相比，所提出的并行化 HFS 方法在分类准确性方面提高了约 4%至 15%。

结论

结果表明，所提出的并行特征选择算法可扩展到不断增长的医疗数据，并以更高的准确性在更短的时间内预测癌症亚型。

相似文献

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data.利用微阵列基因表达数据的并行混合特征选择提高癌症类型的分类准确性。

Genes Genomics. 2019 Nov;41(11):1301-1313. doi: 10.1007/s13258-019-00859-x. Epub 2019 Aug 19.

A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification.基于互信息的混合基因选择算法在基于微阵列的癌症分类中的应用。

PLoS One. 2019 Feb 15;14(2):e0212333. doi: 10.1371/journal.pone.0212333. eCollection 2019.

Detecting biomarkers from microarray data using distributed correlation based gene selection.基于分布式相关的基因选择从微阵列数据中检测生物标志物。

Genes Genomics. 2020 Apr;42(4):449-465. doi: 10.1007/s13258-020-00916-w. Epub 2020 Feb 10.

An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data.一种使用基因表达数据进行癌症分类的集成特征选择算法

Comb Chem High Throughput Screen. 2018;21(9):631-645. doi: 10.2174/1386207322666181220124756.

Hybrid Feature Selection Algorithm mRMR-ICA for Cancer Classification from Microarray Gene Expression Data.用于从微阵列基因表达数据进行癌症分类的混合特征选择算法mRMR-ICA

Comb Chem High Throughput Screen. 2018;21(6):420-430. doi: 10.2174/1386207321666180601074349.

A Novel Rank Aggregation-Based Hybrid Multifilter Wrapper Feature Selection Method in Software Defect Prediction.一种新颖的基于排序聚合的混合多过滤器包装特征选择方法在软件缺陷预测中。

Comput Intell Neurosci. 2021 Nov 24;2021:5069016. doi: 10.1155/2021/5069016. eCollection 2021.

C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods.C-HMOSHSSA：使用多目标元启发式和机器学习方法进行癌症分类的基因选择。

Comput Methods Programs Biomed. 2019 Sep;178:219-235. doi: 10.1016/j.cmpb.2019.06.029. Epub 2019 Jun 29.

Optimization based tumor classification from microarray gene expression data.基于优化的微阵列基因表达数据肿瘤分类。

PLoS One. 2011 Feb 4;6(2):e14579. doi: 10.1371/journal.pone.0014579.

An improved binary particle swarm optimization algorithm for clinical cancer biomarker identification in microarray data.一种用于微阵列数据中临床癌症生物标志物识别的改进二元粒子群优化算法。

Comput Methods Programs Biomed. 2024 Feb;244:107987. doi: 10.1016/j.cmpb.2023.107987. Epub 2023 Dec 21.

A novel parallel feature rank aggregation algorithm for gene selection applied to microarray data classification.一种应用于微阵列数据分类的基因选择的新型并行特征排序聚合算法。

Comput Biol Chem. 2024 Oct;112:108182. doi: 10.1016/j.compbiolchem.2024.108182. Epub 2024 Aug 24.

引用本文的文献

A novel and innovative cancer classification framework through a consecutive utilization of hybrid feature selection.一种新颖且具有创新性的癌症分类框架，通过连续利用混合特征选择实现。

BMC Bioinformatics. 2023 Dec 15;24(1):479. doi: 10.1186/s12859-023-05605-5.

A Novel Proposal for Deep Learning-Based Diabetes Prediction: Converting Clinical Data to Image Data.一种基于深度学习的糖尿病预测新方案：将临床数据转换为图像数据。

Diagnostics (Basel). 2023 Feb 20;13(4):796. doi: 10.3390/diagnostics13040796.

Accurate detection of COVID-19 patients based on distance biased Naïve Bayes (DBNB) classification strategy.基于距离偏差朴素贝叶斯（DBNB）分类策略的新冠病毒肺炎患者的准确检测

Pattern Recognit. 2021 Nov;119:108110. doi: 10.1016/j.patcog.2021.108110. Epub 2021 Jun 16.

Detecting COVID-19 patients based on fuzzy inference engine and Deep Neural Network.基于模糊推理引擎和深度神经网络检测新冠肺炎患者。

Appl Soft Comput. 2021 Feb;99:106906. doi: 10.1016/j.asoc.2020.106906. Epub 2020 Nov 12.

Detecting biomarkers from microarray data using distributed correlation based gene selection.基于分布式相关的基因选择从微阵列数据中检测生物标志物。

Genes Genomics. 2020 Apr;42(4):449-465. doi: 10.1007/s13258-020-00916-w. Epub 2020 Feb 10.

本文引用的文献

Informative gene selection and direct classification of tumor based on Chi-square test of pairwise gene interactions.基于成对基因相互作用的卡方检验进行肿瘤的信息基因选择与直接分类。

Biomed Res Int. 2014;2014:589290. doi: 10.1155/2014/589290. Epub 2014 Jul 23.

A hybrid feature selection method for DNA microarray data.一种用于 DNA 微阵列数据的混合特征选择方法。

Comput Biol Med. 2011 Apr;41(4):228-37. doi: 10.1016/j.compbiomed.2011.02.004. Epub 2011 Mar 3.

A robust gene selection method for microarray-based cancer classification.一种用于基于微阵列的癌症分类的稳健基因选择方法。

Cancer Inform. 2010 Feb 4;9:15-30. doi: 10.4137/cin.s3794.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.癌症的分子分类：通过基因表达监测进行类别发现和类别预测。

Science. 1999 Oct 15;286(5439):531-7. doi: 10.1126/science.286.5439.531.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用微阵列基因表达数据的并行混合特征选择提高癌症类型的分类准确性。

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSION

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献