Mahfouz Mohamed A, Shoukry Amin, Ismail Mohamed A
Department of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt.
Department of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt; Computer Science and Engineering Dept., Egypt Japan University of Science and Technology, Alexandria, Egypt.
Artif Intell Med. 2021 Jan;111:101985. doi: 10.1016/j.artmed.2020.101985. Epub 2020 Nov 8.
In the microarray-based approach for automated cancer diagnosis, the application of the traditional k-nearest neighbors kNN algorithm suffers from several difficulties such as the large number of genes (high dimensionality of the feature space) with many irrelevant genes (noise) relative to the small number of available samples and the imbalance in the size of the samples of the target classes. This research provides an ensemble classifier based on decision models derived from kNN that is applicable to problems characterized by imbalanced small size datasets. The proposed classification method is an ensemble of the traditional kNN algorithm and four novel classification models derived from it. The proposed models exploit the increase in density and connectivity using K-nearest neighbors table (KNN-table) created during the training phase. In the density model, an unseen sample u is classified as belonging to a class t if it achieves the highest increase in density when this sample is added to it i.e. the unseen sample can replace more neighbors in the KNN-table for samples of class t than other classes. In the other three connectivity models, the mean and standard deviation of the distribution of the average, minimum as well the maximum distance to the K neighbors of the members of each class are computed in the training phase. The class t to which u achieves the highest possibility of belongness to its distribution is chosen, i.e. the addition of u to the samples of this class produces the least change to the distribution of the corresponding decision model for class t. Combining the predicted results of the four individual models along with traditional kNN makes the decision space more discriminative. With the help of the KNN-table which can be updated online in the training phase, an improved performance has been achieved compared to the traditional kNN algorithm with slight increase in classification time. The proposed ensemble method achieves significant increase in accuracy compared to the accuracy achieved using any of its base classifiers on Kentridge, GDS3257, Notterman, Leukemia and CNS datasets. The method is also compared to several existing ensemble methods and state of the art techniques using different dimensionality reduction techniques on several standard datasets. The results prove clear superiority of EKNN over several individual and ensemble classifiers regardless of the choice of the gene selection strategy.
在基于微阵列的癌症自动诊断方法中,传统的k近邻(kNN)算法的应用存在若干困难,例如相对于少量可用样本,存在大量基因(特征空间的高维度)以及许多不相关基因(噪声),并且目标类样本的大小存在不平衡。本研究提供了一种基于从kNN派生的决策模型的集成分类器,适用于以不平衡小尺寸数据集为特征的问题。所提出的分类方法是传统kNN算法与从它派生的四个新颖分类模型的集成。所提出的模型利用在训练阶段创建的K近邻表(KNN-table)来提高密度和连通性。在密度模型中,如果一个未见过的样本u在添加到类t时实现了最高的密度增加,即该未见过的样本在KNN表中可以比其他类替换更多属于类t的邻居,则将其分类为属于类t。在其他三个连通性模型中,在训练阶段计算每个类的成员到K个邻居的平均距离、最小距离以及最大距离的分布的均值和标准差。选择u对其分布具有最高归属可能性的类t,即把u添加到该类的样本中会对类t的相应决策模型的分布产生最小的变化。将四个单独模型的预测结果与传统kNN相结合,使决策空间更具判别力。借助在训练阶段可以在线更新的KNN表,与传统kNN算法相比,在分类时间略有增加的情况下,性能得到了提升。与在Kentridge、GDS3257、Notterman、白血病和中枢神经系统数据集上使用其任何一个基分类器所达到的准确率相比,所提出的集成方法在准确率上有显著提高。该方法还与几种现有的集成方法以及在几个标准数据集上使用不同降维技术的现有技术进行了比较。结果证明,无论选择何种基因选择策略,EKNN都明显优于几种单独的和集成的分类器。