• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生命科学中的随机森林数据挖掘:是漫步公园还是迷失丛林?

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

机构信息

Radboud University of Nijmegen, the Netherlands.

出版信息

Brief Bioinform. 2013 May;14(3):315-26. doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.

DOI:10.1093/bib/bbs034
PMID:22786785
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3659301/
Abstract

In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

摘要

在生命科学领域,越来越多的“组学”数据是由不同的高通量技术产生的。通常,只有整合这些数据才能揭示可以通过实验验证或通过机制建模来证实的生物学见解,也就是说,需要复杂的计算方法来提取组学数据中存在的复杂非线性趋势。分类技术允许根据变量(例如遗传关联研究中的 SNPs)训练模型,以分离不同的类别(例如健康受试者与患者)。随机森林(RF)是一种通用的分类算法,适用于这些大型数据集的分析。在生命科学中,RF 很受欢迎,因为 RF 分类模型具有很高的预测准确性,并提供了有关变量对分类重要性的信息。对于组学数据,变量或变量之间的条件关系通常对同一类别的样本子集很重要。例如:在癌症患者的一个类别中,某些 SNP 组合对于具有特定癌症亚型的患者子集可能很重要,但对于不同患者子集则不重要。这些条件关系原则上可以从数据中使用 RF 揭示出来,因为在创建分类模型时,算法会自动考虑这些关系。本综述详细介绍了一些据我们所知很少或从未使用过的 RF 属性,这些属性允许从复杂的组学数据集提取最大的生物学见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc22/3659301/a6192a900060/bbs034f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc22/3659301/4660b4ca3a65/bbs034f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc22/3659301/a6192a900060/bbs034f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc22/3659301/4660b4ca3a65/bbs034f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dc22/3659301/a6192a900060/bbs034f2.jpg

相似文献

1
Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?生命科学中的随机森林数据挖掘:是漫步公园还是迷失丛林?
Brief Bioinform. 2013 May;14(3):315-26. doi: 10.1093/bib/bbs034. Epub 2012 Jul 10.
2
Letter to the Editor: On the term 'interaction' and related phrases in the literature on Random Forests.致编辑的信:关于随机森林文献中“交互作用”一词及相关表述
Brief Bioinform. 2015 Mar;16(2):338-45. doi: 10.1093/bib/bbu012. Epub 2014 Apr 9.
3
Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。
BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.
4
SNP interaction detection with Random Forests in high-dimensional genetic data.利用随机森林在高维遗传数据中检测 SNP 相互作用。
BMC Bioinformatics. 2012 Jul 15;13:164. doi: 10.1186/1471-2105-13-164.
5
Performance of random forest when SNPs are in linkage disequilibrium.单核苷酸多态性处于连锁不平衡状态时随机森林的性能。
BMC Bioinformatics. 2009 Mar 5;10:78. doi: 10.1186/1471-2105-10-78.
6
Data mining of high density genomic variant data for prediction of Alzheimer's disease risk.对高密度基因组变异数据进行数据挖掘,以预测阿尔茨海默病的风险。
BMC Med Genet. 2012 Jan 25;13:7. doi: 10.1186/1471-2350-13-7.
7
Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation.高级数据融合:随机森林接近度和伪样本原理以提高预测准确性和变量解释能力
Anal Chim Acta. 2021 Oct 23;1183:339001. doi: 10.1016/j.aca.2021.339001. Epub 2021 Aug 28.
8
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.多中心随机森林模型在协作临床研究网络中的有效预后预测。
Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.
9
Constructing bi-plots for random forest: Tutorial.构建随机森林的双图:教程。
Anal Chim Acta. 2020 Sep 22;1131:146-155. doi: 10.1016/j.aca.2020.06.043. Epub 2020 Jul 11.
10
A hierarchical integration deep flexible neural forest framework for cancer subtype classification by integrating multi-omics data.一种通过整合多组学数据进行癌症亚型分类的层次化集成深度灵活神经森林框架。
BMC Bioinformatics. 2019 Oct 28;20(1):527. doi: 10.1186/s12859-019-3116-7.

引用本文的文献

1
The association of albumin-corrected anion gap and acute kidney injury in heart failure patients: a competing risk model analysis.心力衰竭患者中白蛋白校正阴离子间隙与急性肾损伤的关联:一项竞争风险模型分析
BMC Cardiovasc Disord. 2025 Apr 11;25(1):277. doi: 10.1186/s12872-025-04723-7.
2
Multimodal machine learning-based model for differentiating nontuberculous mycobacteria from .基于多模态机器学习的区分非结核分枝杆菌与……的模型
Front Public Health. 2025 Feb 17;13:1470072. doi: 10.3389/fpubh.2025.1470072. eCollection 2025.
3
Explainable artificial intelligence of DNA methylation-based brain tumor diagnostics.

本文引用的文献

1
Random forests for genetic association studies.用于基因关联研究的随机森林算法。
Stat Appl Genet Mol Biol. 2011;10(1):32. doi: 10.2202/1544-6115.1691. Epub 2011 Jul 12.
2
PhenoLink--a web-tool for linking phenotype to ~omics data for bacteria: application to gene-trait matching for Lactobacillus plantarum strains.PhenoLink--一个将表型与细菌的组学数据相链接的网络工具:在植物乳杆菌菌株的基因-表型匹配中的应用。
BMC Genomics. 2012 May 4;13:170. doi: 10.1186/1471-2164-13-170.
3
Random forests for genomic data analysis.随机森林在基因组数据分析中的应用。
基于DNA甲基化的脑肿瘤诊断的可解释人工智能
Nat Commun. 2025 Feb 20;16(1):1787. doi: 10.1038/s41467-025-57078-0.
4
Predictive model for PSA persistence after radical prostatectomy using machine learning algorithms.使用机器学习算法预测前列腺癌根治术后前列腺特异性抗原(PSA)持续存在的模型
Front Oncol. 2024 Dec 6;14:1452265. doi: 10.3389/fonc.2024.1452265. eCollection 2024.
5
Iterative random forest-based identification of a novel population with high risk of complications post non-cardiac surgery.基于迭代随机森林的方法识别非心脏手术后并发症风险较高的新型人群。
Sci Rep. 2024 Nov 5;14(1):26741. doi: 10.1038/s41598-024-78482-4.
6
Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis.精神问题、网络成瘾和生活质量预测匈牙利教师的倦怠:机器学习分析。
BMC Public Health. 2024 Aug 27;24(1):2322. doi: 10.1186/s12889-024-19797-9.
7
Gut is associated with better native liver survival in patients with biliary atresia.在胆道闭锁患者中,肠道与更好的自体肝脏存活率相关。
JHEP Rep. 2024 Apr 9;6(7):101090. doi: 10.1016/j.jhepr.2024.101090. eCollection 2024 Jul.
8
Random forest machine-learning algorithm classifies white- and brown-rot fungi according to the number of the genes encoding Carbohydrate-Active enZyme families.随机森林机器学习算法根据编码碳水化合物活性酶家族的基因数量对白色和棕色腐烂真菌进行分类。
Appl Environ Microbiol. 2024 Jul 24;90(7):e0048224. doi: 10.1128/aem.00482-24. Epub 2024 Jun 4.
9
Predicting permeation of compounds across the outer membrane of P. aeruginosa using molecular descriptors.使用分子描述符预测化合物对铜绿假单胞菌外膜的渗透性。
Commun Chem. 2024 Apr 12;7(1):84. doi: 10.1038/s42004-024-01161-y.
10
Text-mining-based feature selection for anticancer drug response prediction.基于文本挖掘的特征选择用于抗癌药物反应预测。
Bioinform Adv. 2024 Mar 26;4(1):vbae047. doi: 10.1093/bioadv/vbae047. eCollection 2024.
Genomics. 2012 Jun;99(6):323-9. doi: 10.1016/j.ygeno.2012.04.003. Epub 2012 Apr 21.
4
Defining the structure of the general stress regulon of Bacillus subtilis using targeted microarray analysis and random forest classification.利用靶向基因芯片分析和随机森林分类定义枯草芽孢杆菌的一般应激调控组的结构。
Microbiology (Reading). 2012 Mar;158(Pt 3):696-707. doi: 10.1099/mic.0.055434-0. Epub 2011 Dec 15.
5
Potential responders to FOLFOX therapy for colorectal cancer by Random Forests analysis.随机森林分析结直肠癌 FOLFOX 治疗的潜在反应者。
Br J Cancer. 2012 Jan 3;106(1):126-32. doi: 10.1038/bjc.2011.505. Epub 2011 Nov 17.
6
Software for systems biology: from tools to integrated platforms.系统生物学软件:从工具到集成平台。
Nat Rev Genet. 2011 Nov 3;12(12):821-32. doi: 10.1038/nrg3096.
7
Predicting residue-residue contacts using random forest models.利用随机森林模型预测残基-残基接触。
Bioinformatics. 2011 Dec 15;27(24):3379-84. doi: 10.1093/bioinformatics/btr579. Epub 2011 Oct 20.
8
Microarray-based cancer prediction using single genes.基于微阵列的单基因癌症预测。
BMC Bioinformatics. 2011 Oct 7;12:391. doi: 10.1186/1471-2105-12-391.
9
A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approaches.基于蛋白质组学数据的临床样本分类方法比较:统计和机器学习方法的案例研究。
PLoS One. 2011;6(9):e24973. doi: 10.1371/journal.pone.0024973. Epub 2011 Sep 28.
10
Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations.随机森林基尼重要性有利于具有较大次要等位基因频率的 SNPs:影响、来源和建议。
Brief Bioinform. 2012 May;13(3):292-304. doi: 10.1093/bib/bbr053. Epub 2011 Sep 10.