Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India.
, Muscat, Oman.
Genes Genomics. 2019 Nov;41(11):1301-1313. doi: 10.1007/s13258-019-00859-x. Epub 2019 Aug 19.
Data mining techniques are used to mine unknown knowledge from huge data. Microarray gene expression (MGE) data plays a major role in predicting type of cancer. But as MGE data is huge in volume, applying traditional data mining approaches is time consuming. Hence parallel programming frameworks like Hadoop, Spark and Mahout are necessary to ease the task of computation.
Not all the gene expressions are necessary in prediction, it is very essential to select important genes for improving classification accuracy. So feature selection algorithms are parallelized and executed on Spark framework to eliminate unnecessary genes and identify only predictive genes in very less time without affecting prediction accuracy.
Parallelized hybrid feature selection (HFS) method is proposed to serve the purpose. This method includes parallelized correlation feature subset selection followed by rank-based feature selection methods. The selected subset of genes is evaluated using parallel classification algorithms. The accuracy values obtained are compared with existing rank-weight feature selection, parallelized recursive feature selection methods and also with the values obtained by executing parallelized HFS on DistributedWekaSpark.
The classification accuracy obtained with the proposed parallelized HFS method is 97% and 79% for gastric cancer and childhood leukemia respectively. The proposed parallelized HFS method produced ~ 4% to ~ 15% improvement in classification accuracy when compared with previous methods.
The results reveal the fact that the proposed parallelized feature selection algorithm is scalable to growing medical data and predicts cancer sub-types in lesser time with higher accuracy.
数据挖掘技术用于从大量数据中挖掘未知知识。微阵列基因表达 (MGE) 数据在预测癌症类型方面起着重要作用。但是,由于 MGE 数据量庞大,应用传统的数据挖掘方法非常耗时。因此,需要使用并行编程框架(如 Hadoop、Spark 和 Mahout)来减轻计算任务的负担。
并非所有基因表达都对预测有必要,选择重要基因对于提高分类准确性非常重要。因此,将特征选择算法并行化并在 Spark 框架上执行,以在不影响预测准确性的情况下,在更短的时间内消除不必要的基因并仅识别出有预测能力的基因。
提出了一种并行化混合特征选择 (HFS) 方法来实现这一目标。该方法包括并行化相关特征子集选择,然后是基于排名的特征选择方法。使用并行分类算法评估选择的基因子集。将获得的准确性值与现有的排名加权特征选择、并行递归特征选择方法进行比较,并与在 DistributedWekaSpark 上执行并行化 HFS 获得的值进行比较。
所提出的并行化 HFS 方法对胃癌和儿童白血病的分类准确率分别为 97%和 79%。与以前的方法相比,所提出的并行化 HFS 方法在分类准确性方面提高了约 4%至 15%。
结果表明,所提出的并行特征选择算法可扩展到不断增长的医疗数据,并以更高的准确性在更短的时间内预测癌症亚型。