Rabiei Niloofar, Soltanian Ali Reza, Farhadian Maryam, Bahreini Fatemeh
Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.
Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran.
Cell J. 2023 May 28;25(5):347-353. doi: 10.22074/cellj.2023.1971852.1156.
In microarray datasets, hundreds and thousands of genes are measured in a small number of samples, and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as missing. It is a difficult task to determine the genes that cause disease or cancer from a large number of genes. This study aimed to find effective genes in pancreatic cancer (PC). First, the K-nearest neighbor (KNN) imputation method was used to solve the problem of missing values (MVs) of gene expression. Then, the random forest algorithm was used to identify the genes associated with PC.
In this retrospective study, 24 samples from the GSE14245 dataset were examined. Twelve samples were from patients with PC, and 12 samples were from healthy control. After preprocessing and applying the fold-change technique, 29482 genes were used. We used the KNN imputation method to impute when a particular gene had MVs. Then, the genes most strongly associated with PC were selected using the random forest algorithm. We classified the dataset using support vector machine (SVM) and naïve bayes (NB) classifiers, and F-score and Jaccard indices were reported.
Out of the 29482 genes, 1185 genes with fold-changes greater than 3 were selected. After selecting the most associated genes, 21 genes with the most important value were identified. and had the highest and lowest importance values, respectively. The F-score and Jaccard value of the SVM and NB classifiers were 95.5, 93, 92, and 92 percent, respectively.
This study is based on the application of the fold change technique, imputation method, and random forest algorithm and could find the most associated genes that were not identified in many studies. We therefore suggest researchers use the random forest algorithm to detect the related genes within the disease of interest.
在微阵列数据集中,在少量样本中测量了成千上万的基因,有时由于实验过程中出现的问题,一些基因的表达值被记录为缺失。从大量基因中确定导致疾病或癌症的基因是一项艰巨的任务。本研究旨在寻找胰腺癌(PC)中的有效基因。首先,使用K近邻(KNN)插补方法解决基因表达缺失值(MVs)的问题。然后,使用随机森林算法识别与PC相关的基因。
在这项回顾性研究中,检查了来自GSE14245数据集的24个样本。12个样本来自PC患者,12个样本来自健康对照。经过预处理并应用倍数变化技术后,使用了29482个基因。当特定基因存在MVs时,我们使用KNN插补方法进行插补。然后,使用随机森林算法选择与PC最密切相关的基因。我们使用支持向量机(SVM)和朴素贝叶斯(NB)分类器对数据集进行分类,并报告了F分数和杰卡德指数。
在29482个基因中,选择了1185个倍数变化大于3的基因。在选择最相关的基因后,确定了21个具有最重要值的基因。 和 分别具有最高和最低的重要性值。SVM和NB分类器的F分数和杰卡德值分别为95.5%、93%、92%和92%。
本研究基于倍数变化技术、插补方法和随机森林算法的应用,能够找到许多研究中未识别的最相关基因。因此,我们建议研究人员使用随机森林算法来检测感兴趣疾病中的相关基因。