Arafa Ahmed, El-Fishawy Nawal, Badawy Mohammed, Radad Marwa
Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia, Egypt.
J Biol Eng. 2023 Jan 30;17(1):7. doi: 10.1186/s13036-022-00319-3.
In the current genomic era, gene expression datasets have become one of the main tools utilized in cancer classification. Both curse of dimensionality and class imbalance problems are inherent characteristics of these datasets. These characteristics have a negative impact on the performance of most classifiers when used to classify cancer using genomic datasets.
This paper introduces Reduced Noise-Autoencoder (RN-Autoencoder) for pre-processing imbalanced genomic datasets for precise cancer classification. Firstly, RN-Autoencoder solves the curse of dimensionality problem by utilizing the autoencoder for feature reduction and hence generating new extracted data with lower dimensionality. In the next stage, RN-Autoencoder introduces the extracted data to the well-known Reduced Noise-Synthesis Minority Over Sampling Technique (RN- SMOTE) that efficiently solve the problem of class imbalance in the extracted data. RN-Autoencoder has been evaluated using different classifiers and various imbalanced datasets with different imbalance ratios. The results proved that the performance of the classifiers has been improved with RN-Autoencoder and outperformed the performance with original data and extracted data with percentages based on the classifier, dataset and evaluation metric. Also, the performance of RN-Autoencoder has been compared to the performance of the current state of the art and resulted in an increase up to 18.017, 19.183, 18.58 and 8.87% in terms of test accuracy using colon, leukemia, Diffuse Large B-Cell Lymphoma (DLBCL) and Wisconsin Diagnostic Breast Cancer (WDBC) datasets respectively.
RN-Autoencoder is a model for cancer classification using imbalanced gene expression datasets. It utilizes the autoencoder to reduce the high dimensionality of the gene expression datasets and then handles the class imbalance using RN-SMOTE. RN-Autoencoder has been evaluated using many different classifiers and many different imbalanced datasets. The performance of many classifiers has improved and some have succeeded in classifying cancer with 100% performance in terms of all used metrics. In addition, RN-Autoencoder outperformed many recent works using the same datasets.
在当前的基因组时代,基因表达数据集已成为癌症分类中使用的主要工具之一。维度诅咒和类不平衡问题是这些数据集的固有特征。当使用基因组数据集对癌症进行分类时,这些特征会对大多数分类器的性能产生负面影响。
本文介绍了用于预处理不平衡基因组数据集以进行精确癌症分类的降噪自动编码器(RN - 自动编码器)。首先,RN - 自动编码器通过利用自动编码器进行特征约简来解决维度诅咒问题,从而生成维度更低的新提取数据。在下一阶段,RN - 自动编码器将提取的数据引入著名的降噪合成少数过采样技术(RN - SMOTE),该技术能有效解决提取数据中的类不平衡问题。RN - 自动编码器已使用不同的分类器和具有不同不平衡率的各种不平衡数据集进行了评估。结果证明,使用RN - 自动编码器后分类器的性能得到了提高,并且在基于分类器、数据集和评估指标的百分比方面优于原始数据和提取数据的性能。此外,RN - 自动编码器的性能已与当前的先进技术进行了比较,在使用结肠癌、白血病、弥漫性大B细胞淋巴瘤(DLBCL)和威斯康星诊断性乳腺癌(WDBC)数据集时,测试准确率分别提高了18.017%、19.183%、18.58%和8.87%。
RN - 自动编码器是一种使用不平衡基因表达数据集进行癌症分类的模型。它利用自动编码器降低基因表达数据集的高维度,然后使用RN - SMOTE处理类不平衡问题。RN - 自动编码器已使用许多不同的分类器和许多不同的不平衡数据集进行了评估。许多分类器的性能得到了改善,并且一些分类器在所有使用的指标方面都成功地以100%的性能对癌症进行了分类。此外,RN - 自动编码器在使用相同数据集时优于许多近期的研究工作。