用于在微阵列数据集中识别可切除胰腺癌相关基因的基因选择的随机森林算法性能评估：一项回顾性研究

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study.

作者信息

Rabiei Niloofar, Soltanian Ali Reza, Farhadian Maryam, Bahreini Fatemeh

机构信息

Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.

Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran.

出版信息

Cell J. 2023 May 28;25(5):347-353. doi: 10.22074/cellj.2023.1971852.1156.

DOI:10.22074/cellj.2023.1971852.1156

PMID:37300296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10257059/

Abstract

OBJECTIVE

In microarray datasets, hundreds and thousands of genes are measured in a small number of samples, and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as missing. It is a difficult task to determine the genes that cause disease or cancer from a large number of genes. This study aimed to find effective genes in pancreatic cancer (PC). First, the K-nearest neighbor (KNN) imputation method was used to solve the problem of missing values (MVs) of gene expression. Then, the random forest algorithm was used to identify the genes associated with PC.

MATERIALS AND METHODS

In this retrospective study, 24 samples from the GSE14245 dataset were examined. Twelve samples were from patients with PC, and 12 samples were from healthy control. After preprocessing and applying the fold-change technique, 29482 genes were used. We used the KNN imputation method to impute when a particular gene had MVs. Then, the genes most strongly associated with PC were selected using the random forest algorithm. We classified the dataset using support vector machine (SVM) and naïve bayes (NB) classifiers, and F-score and Jaccard indices were reported.

RESULTS

Out of the 29482 genes, 1185 genes with fold-changes greater than 3 were selected. After selecting the most associated genes, 21 genes with the most important value were identified. and had the highest and lowest importance values, respectively. The F-score and Jaccard value of the SVM and NB classifiers were 95.5, 93, 92, and 92 percent, respectively.

CONCLUSION

This study is based on the application of the fold change technique, imputation method, and random forest algorithm and could find the most associated genes that were not identified in many studies. We therefore suggest researchers use the random forest algorithm to detect the related genes within the disease of interest.

摘要

目的

在微阵列数据集中，在少量样本中测量了成千上万的基因，有时由于实验过程中出现的问题，一些基因的表达值被记录为缺失。从大量基因中确定导致疾病或癌症的基因是一项艰巨的任务。本研究旨在寻找胰腺癌（PC）中的有效基因。首先，使用K近邻（KNN）插补方法解决基因表达缺失值（MVs）的问题。然后，使用随机森林算法识别与PC相关的基因。

材料和方法

在这项回顾性研究中，检查了来自GSE14245数据集的24个样本。12个样本来自PC患者，12个样本来自健康对照。经过预处理并应用倍数变化技术后，使用了29482个基因。当特定基因存在MVs时，我们使用KNN插补方法进行插补。然后，使用随机森林算法选择与PC最密切相关的基因。我们使用支持向量机（SVM）和朴素贝叶斯（NB）分类器对数据集进行分类，并报告了F分数和杰卡德指数。

结果

在29482个基因中，选择了1185个倍数变化大于3的基因。在选择最相关的基因后，确定了21个具有最重要值的基因。和分别具有最高和最低的重要性值。SVM和NB分类器的F分数和杰卡德值分别为95.5%、93%、92%和92%。

结论

本研究基于倍数变化技术、插补方法和随机森林算法的应用，能够找到许多研究中未识别的最相关基因。因此，我们建议研究人员使用随机森林算法来检测感兴趣疾病中的相关基因。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于在微阵列数据集中识别可切除胰腺癌相关基因的基因选择的随机森林算法性能评估：一项回顾性研究

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

用于在微阵列数据集中识别可切除胰腺癌相关基因的基因选择的随机森林算法性能评估：一项回顾性研究

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献