Bose Shilpi, Das Chandra, Banerjee Abhik, Ghosh Kuntal, Chattopadhyay Matangini, Chattopadhyay Samiran, Barik Aishwarya
Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India.
Machine Intelligence Unit & Center for Soft Computing Research, Indian Statistical Institute, Kolkata, West Bengal, India.
PeerJ Comput Sci. 2021 Sep 16;7:e671. doi: 10.7717/peerj-cs.671. eCollection 2021.
Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis.
In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets.
To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.
机器学习是一种从数据中学习并从大型复杂数据集中检测内在模式的机器智能技术。由于这种能力,机器学习技术在医学应用中被广泛使用,特别是在使用大规模基因组和蛋白质组数据的情况下。基于生物分子谱数据的癌症分类是医学应用中的一个非常重要的课题,因为它提高了癌症的诊断准确性,并使癌症治疗能够成功完成。因此,机器学习技术在癌症检测和预后中被广泛使用。
在本文中,提出了一种新的集成机器学习分类模型,即基于多重过滤和监督属性聚类算法的集成分类模型(MFSAC-EC),该模型可以处理类不平衡问题和微阵列数据集的高维性。该模型首先从原始训练数据中生成多个自采样数据集,在这些数据集中应用过采样程序来处理类不平衡问题。然后将所提出的MFSAC方法应用于每个自采样数据集以生成子数据集,每个子数据集都包含原始数据集的最相关/信息丰富属性的子集。MFSAC方法是一种将多个过滤器与一种新的监督属性聚类算法相结合的特征选择技术。然后针对每个子数据集分别构建一个基础分类器,最后,使用多数投票技术将这些基础分类器的预测准确性进行组合,形成基于MFSAC的集成分类器。此外,根据这些子数据集中出现的频率,选择一些信息最丰富的属性作为重要特征。
为了评估所提出的MFSAC-EC模型的性能,将其应用于不同的高维微阵列基因表达数据集进行癌症样本分类。将所提出的模型与现有的知名模型进行比较,以确定其相对于其他模型的有效性。从实验结果中发现,与其他现有的知名模型相比,所提出的分类器的泛化性能/测试准确性明显更好。除此之外,还发现所提出的模型可以识别许多重要属性/生物标志物基因。