• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于在微阵列数据集中识别可切除胰腺癌相关基因的基因选择的随机森林算法性能评估:一项回顾性研究

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study.

作者信息

Rabiei Niloofar, Soltanian Ali Reza, Farhadian Maryam, Bahreini Fatemeh

机构信息

Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.

Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran.

出版信息

Cell J. 2023 May 28;25(5):347-353. doi: 10.22074/cellj.2023.1971852.1156.

DOI:10.22074/cellj.2023.1971852.1156
PMID:37300296
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10257059/
Abstract

OBJECTIVE

In microarray datasets, hundreds and thousands of genes are measured in a small number of samples, and sometimes due to problems that occur during the experiment, the expression value of some genes is recorded as missing. It is a difficult task to determine the genes that cause disease or cancer from a large number of genes. This study aimed to find effective genes in pancreatic cancer (PC). First, the K-nearest neighbor (KNN) imputation method was used to solve the problem of missing values (MVs) of gene expression. Then, the random forest algorithm was used to identify the genes associated with PC.

MATERIALS AND METHODS

In this retrospective study, 24 samples from the GSE14245 dataset were examined. Twelve samples were from patients with PC, and 12 samples were from healthy control. After preprocessing and applying the fold-change technique, 29482 genes were used. We used the KNN imputation method to impute when a particular gene had MVs. Then, the genes most strongly associated with PC were selected using the random forest algorithm. We classified the dataset using support vector machine (SVM) and naïve bayes (NB) classifiers, and F-score and Jaccard indices were reported.

RESULTS

Out of the 29482 genes, 1185 genes with fold-changes greater than 3 were selected. After selecting the most associated genes, 21 genes with the most important value were identified. and had the highest and lowest importance values, respectively. The F-score and Jaccard value of the SVM and NB classifiers were 95.5, 93, 92, and 92 percent, respectively.

CONCLUSION

This study is based on the application of the fold change technique, imputation method, and random forest algorithm and could find the most associated genes that were not identified in many studies. We therefore suggest researchers use the random forest algorithm to detect the related genes within the disease of interest.

摘要

目的

在微阵列数据集中,在少量样本中测量了成千上万的基因,有时由于实验过程中出现的问题,一些基因的表达值被记录为缺失。从大量基因中确定导致疾病或癌症的基因是一项艰巨的任务。本研究旨在寻找胰腺癌(PC)中的有效基因。首先,使用K近邻(KNN)插补方法解决基因表达缺失值(MVs)的问题。然后,使用随机森林算法识别与PC相关的基因。

材料和方法

在这项回顾性研究中,检查了来自GSE14245数据集的24个样本。12个样本来自PC患者,12个样本来自健康对照。经过预处理并应用倍数变化技术后,使用了29482个基因。当特定基因存在MVs时,我们使用KNN插补方法进行插补。然后,使用随机森林算法选择与PC最密切相关的基因。我们使用支持向量机(SVM)和朴素贝叶斯(NB)分类器对数据集进行分类,并报告了F分数和杰卡德指数。

结果

在29482个基因中,选择了1185个倍数变化大于3的基因。在选择最相关的基因后,确定了21个具有最重要值的基因。 和 分别具有最高和最低的重要性值。SVM和NB分类器的F分数和杰卡德值分别为95.5%、93%、92%和92%。

结论

本研究基于倍数变化技术、插补方法和随机森林算法的应用,能够找到许多研究中未识别的最相关基因。因此,我们建议研究人员使用随机森林算法来检测感兴趣疾病中的相关基因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/20b3b05d7eda/Cell-J-25-347-g04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/361d5b165f6f/Cell-J-25-347-g01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/a5f25cabb59e/Cell-J-25-347-g02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/ef93d557c672/Cell-J-25-347-g03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/20b3b05d7eda/Cell-J-25-347-g04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/361d5b165f6f/Cell-J-25-347-g01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/a5f25cabb59e/Cell-J-25-347-g02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/ef93d557c672/Cell-J-25-347-g03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a41d/10257059/20b3b05d7eda/Cell-J-25-347-g04.jpg

相似文献

1
The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study.用于在微阵列数据集中识别可切除胰腺癌相关基因的基因选择的随机森林算法性能评估:一项回顾性研究
Cell J. 2023 May 28;25(5):347-353. doi: 10.22074/cellj.2023.1971852.1156.
2
GSEA-SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics.GSEA-SDBE:一种基于基因集富集分析(GSEA)并分析性能指标差异的乳腺癌分类基因选择方法。
PLoS One. 2022 Apr 26;17(4):e0263171. doi: 10.1371/journal.pone.0263171. eCollection 2022.
3
Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules.基于基因表达谱和功能模块,替换不可靠的cDNA微阵列测量值对疾病分类的影响。
Bioinformatics. 2006 Dec 1;22(23):2883-9. doi: 10.1093/bioinformatics/btl339. Epub 2006 Jun 29.
4
R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.R-Ensembler:一种基于粗糙集的贪婪集成属性选择算法,具有 kNN 插补功能,用于医学数据的分类。
Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.
5
Addressing the missing data challenge in multi-modal datasets for the diagnosis of Alzheimer's disease.应对多模态数据集中用于阿尔茨海默病诊断的缺失数据挑战。
J Neurosci Methods. 2022 Jun 1;375:109582. doi: 10.1016/j.jneumeth.2022.109582. Epub 2022 Mar 26.
6
A Highly Discriminative Hybrid Feature Selection Algorithm for Cancer Diagnosis.一种用于癌症诊断的高判别混合特征选择算法。
ScientificWorldJournal. 2022 Aug 9;2022:1056490. doi: 10.1155/2022/1056490. eCollection 2022.
7
Missing value imputation in high-dimensional phenomic data: imputable or not, and how?高维表型组数据中的缺失值插补:是否可插补以及如何插补?
BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.
8
A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data.一种基于模糊的独立成分子空间特征选择方法,用于微阵列数据的机器学习分类。
Genom Data. 2016 Feb 23;8:4-15. doi: 10.1016/j.gdata.2016.02.012. eCollection 2016 Jun.
9
Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery.使用哨兵-2影像的随机森林、k近邻和支持向量机分类器用于土地覆盖分类的比较
Sensors (Basel). 2017 Dec 22;18(1):18. doi: 10.3390/s18010018.
10
Missing data techniques in classification for cardiovascular dysautonomias diagnosis.分类中缺失数据技术在心血管自主神经病变诊断中的应用。
Med Biol Eng Comput. 2020 Nov;58(11):2863-2878. doi: 10.1007/s11517-020-02266-x. Epub 2020 Sep 24.

引用本文的文献

1
Structure-activity relationships for the G-quadruplex-targeting experimental drug QN-302 and two analogues probed with comparative transcriptome profiling and molecular modeling.用比较转录组谱分析和分子建模方法研究靶向 G-四链体的实验药物 QN-302 及其两种类似物的构效关系。
Sci Rep. 2024 Feb 11;14(1):3447. doi: 10.1038/s41598-024-54080-2.

本文引用的文献

1
Anticachectic regulator analysis reveals Perp-dependent antitumorigenic properties of 3-methyladenine in pancreatic cancer.抗分解代谢调节剂分析揭示了 3-甲基腺嘌呤在胰腺癌中的依赖 Perp 的抗肿瘤特性。
JCI Insight. 2022 Jan 25;7(2):e153842. doi: 10.1172/jci.insight.153842.
2
A case-control study in Taiwanese cohort and meta-analysis of serum ferritin in pancreatic cancer.台湾队列的病例对照研究和胰腺癌血清铁蛋白的荟萃分析。
Sci Rep. 2021 Oct 28;11(1):21242. doi: 10.1038/s41598-021-00650-7.
3
TRIM29 alters bioenergetics of pancreatic cancer cells via cooperation of miR-2355-3p and DDX3X recruitment to AK4 transcript.
TRIM29通过miR-2355-3p与DDX3X募集至AK4转录本的协同作用改变胰腺癌细胞的生物能量学。
Mol Ther Nucleic Acids. 2021 Feb 3;24:579-590. doi: 10.1016/j.omtn.2021.01.027. eCollection 2021 Jun 4.
4
Current epidemiology of pancreatic cancer: Challenges and opportunities.胰腺癌的当前流行病学:挑战与机遇。
Chin J Cancer Res. 2020 Dec 31;32(6):705-719. doi: 10.21147/j.issn.1000-9604.2020.06.04.
5
Worldwide Burden of, Risk Factors for, and Trends in Pancreatic Cancer.全球胰腺癌负担、风险因素及趋势。
Gastroenterology. 2021 Feb;160(3):744-754. doi: 10.1053/j.gastro.2020.10.007. Epub 2020 Oct 13.
6
Pancreatic Cancer in Iran: an Epidemiological Review.伊朗胰腺癌:一项流行病学综述。
J Gastrointest Cancer. 2020 Jun;51(2):418-424. doi: 10.1007/s12029-019-00279-w.
7
Missing-Values Imputation Algorithms for Microarray Gene Expression Data.用于微阵列基因表达数据的缺失值插补算法
Methods Mol Biol. 2019;1986:255-266. doi: 10.1007/978-1-4939-9442-7_12.
8
Cancer statistics, 2019.癌症统计数据,2019 年。
CA Cancer J Clin. 2019 Jan;69(1):7-34. doi: 10.3322/caac.21551. Epub 2019 Jan 8.
9
Analysis of dynamic molecular networks for pancreatic ductal adenocarcinoma progression.胰腺导管腺癌进展的动态分子网络分析
Cancer Cell Int. 2018 Dec 22;18:214. doi: 10.1186/s12935-018-0718-5. eCollection 2018.
10
Genetic variations associated with gemcitabine treatment outcome in pancreatic cancer.与胰腺癌吉西他滨治疗结果相关的基因变异
Pharmacogenet Genomics. 2016 Dec;26(12):527-537. doi: 10.1097/FPC.0000000000000241.