基于基因芯片数据的卵巢癌检测和分类性能分析。

Performance Analysis of Ovarian Cancer Detection and Classification for Microarray Gene Data.

机构信息

Bannari Amman Institute of Technology, India.

出版信息

Biomed Res Int. 2022 Jul 15;2022:6750457. doi: 10.1155/2022/6750457. eCollection 2022.

DOI:10.1155/2022/6750457

PMID:35872866

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9307352/

Abstract

The most common gynecologic cancer, behind cervical and uterine, is ovarian cancer. Ovarian cancer is a severe concern for women. Abnormal cells form and spread throughout the body. Ovarian cancer microarray data can diagnose and prognosis. Typically, ovarian cancer microarray data contains tens of thousands of genes. In order to reduce computational complexity, selecting the most critical genes or attributes in the entire dataset is necessary. Because microarray datasets have limited samples and many characteristics, classifier detection lags. So, dimensionality reduction measures are essential to protect disease classification genes. In this research, initially the ANOVA method is used for gene selection and then two clustering-based and three transform-based feature extraction methods, namely, Fuzzy C Means, Softmax Discriminant Algorithm (SDA), Hilbert Transform, Fast Fourier Transform (FFT), and Discrete Cosine Transform (DCT), respectively, are used to select relevant genes further. Six classifiers further classify the features as normal and abnormal. The NLR classifier gives the highest accuracy for SDA features at 92%, and KNN gives the lowest accuracy of 55% for SDA, Hilbert, and DCT features. With correlation distance feature selection, the NLR classifier attains the lowest accuracy of 53%, and the highest accuracy of 88% is obtained by the GMM classifier.

摘要

最常见的妇科癌症，仅次于宫颈癌和子宫内膜癌，是卵巢癌。卵巢癌是女性严重关注的问题。异常细胞形成并扩散到全身。卵巢癌微阵列数据可用于诊断和预后。通常，卵巢癌微阵列数据包含数万种基因。为了降低计算复杂度，有必要在整个数据集选择最关键的基因或属性。由于微阵列数据集的样本有限且特征较多，分类器的检测存在滞后。因此，降维措施对于保护疾病分类基因至关重要。在这项研究中，首先使用 ANOVA 方法进行基因选择，然后使用两种基于聚类的和三种基于变换的特征提取方法，即模糊 C 均值、Softmax 判别算法（SDA）、希尔伯特变换、快速傅里叶变换（FFT）和离散余弦变换（DCT），进一步选择相关基因。然后，六个分类器将特征进一步分为正常和异常。在 SDA 特征上，NLR 分类器的准确率最高，为 92%，而 KNN 的准确率最低，为 55%，适用于 SDA、希尔伯特和 DCT 特征。使用相关距离特征选择，NLR 分类器的准确率最低，为 53%，而 GMM 分类器的准确率最高，为 88%。