基于随机森林的生物标志物发现和诊断研究功效分析框架。

A random forest based biomarker discovery and power analysis framework for diagnostics research.

机构信息

College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK.

Institute of Translational Medicine, University Hospitals Birmingham NHS, Foundation Trust, Birmingham, B15 2TT, UK.

出版信息

BMC Med Genomics. 2020 Nov 23;13(1):178. doi: 10.1186/s12920-020-00826-6.

DOI:10.1186/s12920-020-00826-6

PMID:33228632

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7685541/

Abstract

BACKGROUND

Biomarker identification is one of the major and important goal of functional genomics and translational medicine studies. Large scale -omics data are increasingly being accumulated and can provide vital means for the identification of biomarkers for the early diagnosis of complex disease and/or for advanced patient/diseases stratification. These tasks are clearly interlinked, and it is essential that an unbiased and stable methodology is applied in order to address them. Although, recently, many, primarily machine learning based, biomarker identification approaches have been developed, the exploration of potential associations between biomarker identification and the design of future experiments remains a challenge.

METHODS

In this study, using both simulated and published experimentally derived datasets, we assessed the performance of several state-of-the-art Random Forest (RF) based decision approaches, namely the Boruta method, the permutation based feature selection without correction method, the permutation based feature selection with correction method, and the backward elimination based feature selection method. Moreover, we conducted a power analysis to estimate the number of samples required for potential future studies.

RESULTS

We present a number of different RF based stable feature selection methods and compare their performances using simulated, as well as published, experimentally derived, datasets. Across all of the scenarios considered, we found the Boruta method to be the most stable methodology, whilst the Permutation (Raw) approach offered the largest number of relevant features, when allowed to stabilise over a number of iterations. Finally, we developed and made available a web interface ( https://joelarkman.shinyapps.io/PowerTools/ ) to streamline power calculations thereby aiding the design of potential future studies within a translational medicine context.

CONCLUSIONS

We developed a RF-based biomarker discovery framework and provide a web interface for our framework, termed PowerTools, that caters the design of appropriate and cost-effective subsequent future omics study.

摘要

背景

生物标志物的鉴定是功能基因组学和转化医学研究的主要和重要目标之一。大规模的组学数据不断积累，可以为复杂疾病的早期诊断和/或患者/疾病的高级分层提供生物标志物的鉴定提供重要手段。这些任务显然是相互关联的，为了解决这些问题，必须应用一种无偏且稳定的方法。尽管最近已经开发了许多主要基于机器学习的生物标志物鉴定方法，但探索生物标志物鉴定与未来实验设计之间的潜在关联仍然是一个挑战。

方法

在这项研究中，我们使用模拟和已发表的实验衍生数据集，评估了几种最先进的基于随机森林 (RF) 的决策方法的性能，即 Boruta 方法、未校正的基于置换的特征选择方法、校正的基于置换的特征选择方法和基于后向消除的特征选择方法。此外，我们进行了功效分析，以估计潜在未来研究所需的样本数量。

结果

我们提出了一些不同的基于 RF 的稳定特征选择方法，并使用模拟和已发表的实验衍生数据集比较了它们的性能。在所考虑的所有情况下，我们发现 Boruta 方法是最稳定的方法，而 Permutation（Raw）方法在允许经过多次迭代稳定后，提供了最多的相关特征。最后，我们开发并提供了一个网络界面（https://joelarkman.shinyapps.io/PowerTools/），以简化功效计算，从而有助于在转化医学背景下设计潜在的未来研究。

结论

我们开发了一种基于 RF 的生物标志物发现框架，并提供了一个网络界面，称为 PowerTools，该界面可以为适当和具有成本效益的后续组学研究提供设计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff9/7685541/e6b2820fc465/12920_2020_826_Fig1_HTML.jpg

相似文献

A random forest based biomarker discovery and power analysis framework for diagnostics research.基于随机森林的生物标志物发现和诊断研究功效分析框架。

BMC Med Genomics. 2020 Nov 23;13(1):178. doi: 10.1186/s12920-020-00826-6.

-Omics biomarker identification pipeline for translational medicine.组学生物标志物鉴定在转化医学中的应用

J Transl Med. 2019 May 14;17(1):155. doi: 10.1186/s12967-019-1912-5.

Biomarker discovery in inflammatory bowel diseases using network-based feature selection.基于网络的特征选择在炎症性肠病生物标志物发现中的应用。

PLoS One. 2019 Nov 22;14(11):e0225382. doi: 10.1371/journal.pone.0225382. eCollection 2019.

Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods.基于稳健机器学习-递归特征消除方法的基因表达数据的稳健生物标志物筛选。

Comput Biol Chem. 2022 Oct;100:107747. doi: 10.1016/j.compbiolchem.2022.107747. Epub 2022 Jul 29.

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.基于聚类和纵向数据的医学预测模型的特征选择随机森林方法。

J Biomed Inform. 2021 May;117:103763. doi: 10.1016/j.jbi.2021.103763. Epub 2021 Mar 26.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data.基于非靶向代谢组学数据的早期预测生物标志物发现的特征选择方法。

Front Mol Biosci. 2016 Jul 8;3:30. doi: 10.3389/fmolb.2016.00030. eCollection 2016.

A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data.一种从组学数据中识别具有生物学相关性且最小化的生物标志物组合的机器学习启发式方法。

BMC Genomics. 2015;16 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2164-16-S1-S2. Epub 2015 Jan 15.

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data.LANDMark：一种基于集成方法的高通量测序数据中生物标志物的有监督选择。

BMC Bioinformatics. 2022 Mar 31;23(1):110. doi: 10.1186/s12859-022-04631-z.

A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.多中心随机森林模型在协作临床研究网络中的有效预后预测。

Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.

引用本文的文献

Integrative analysis identifies FBXO5 as a critical mediator of CRPC progression and bone metastatic potential.综合分析确定FBXO5是去势抵抗性前列腺癌进展和骨转移潜能的关键调节因子。

Discov Oncol. 2025 Aug 7;16(1):1495. doi: 10.1007/s12672-025-03069-y.

Protocol for the REBOUND study: a cohort study to uncover fundamental mechanisms of accelerated ageing and impaired resilience following cancer surgery and treatment.REBOUND研究方案：一项队列研究，旨在揭示癌症手术和治疗后加速衰老及恢复力受损的基本机制。

BMC Geriatr. 2025 Jul 8;25(1):502. doi: 10.1186/s12877-025-06109-y.

Predicting Cognitive Decline in Motoric Cognitive Risk Syndrome Using Machine Learning Approaches.使用机器学习方法预测运动性认知风险综合征中的认知衰退。

Diagnostics (Basel). 2025 May 26;15(11):1338. doi: 10.3390/diagnostics15111338.

Fecal gut microbiota and amino acids as noninvasive diagnostic biomarkers of Pediatric inflammatory bowel disease.粪便肠道微生物群和氨基酸作为儿童炎症性肠病的非侵入性诊断生物标志物。

Gut Microbes. 2025 Dec;17(1):2517828. doi: 10.1080/19490976.2025.2517828. Epub 2025 Jun 12.

Intervention of machine learning in bladder cancer research using multi-omics datasets: systematic review on biomarker identification.利用多组学数据集的机器学习在膀胱癌研究中的干预：生物标志物识别的系统评价

Discov Oncol. 2025 Jun 5;16(1):1010. doi: 10.1007/s12672-025-02734-6.

Proteomic associations with cognitive variability as measured by the Wisconsin Card Sorting Test in a healthy Thai population: A machine learning approach.泰国健康人群中通过威斯康星卡片分类测验测量的蛋白质组学与认知变异性的关联：一种机器学习方法。

PLoS One. 2025 Feb 20;20(2):e0313365. doi: 10.1371/journal.pone.0313365. eCollection 2025.

Novel Transcriptomic Signatures in Fibrostenotic Crohn's Disease: Dysregulated Pathways, Promising Biomarkers, and Putative Therapeutic Targets.纤维狭窄型克罗恩病中的新型转录组特征：失调的通路、有前景的生物标志物及潜在治疗靶点

Inflamm Bowel Dis. 2025 Jun 13;31(6):1502-1513. doi: 10.1093/ibd/izaf021.

Role of Aging in Ulcerative Colitis Pathogenesis: A Focus on ETS1 as a Promising Biomarker.衰老在溃疡性结肠炎发病机制中的作用：聚焦ETS1作为一种有前景的生物标志物

J Inflamm Res. 2025 Feb 6;18:1839-1853. doi: 10.2147/JIR.S504040. eCollection 2025.

Gut metatranscriptomics based de novo assembly reveals microbial signatures predicting immunotherapy outcomes in non-small cell lung cancer.基于肠道宏转录组学的从头组装揭示了预测非小细胞肺癌免疫治疗结果的微生物特征。

J Transl Med. 2024 Nov 19;22(1):1044. doi: 10.1186/s12967-024-05835-y.

The Role of Intersectional Stigma in Coronary Artery Disease Among Cisgender Women Aging with HIV.交叉污名在感染艾滋病毒的顺性别老年女性冠状动脉疾病中的作用

Behav Med. 2024 Nov 15:1-12. doi: 10.1080/08964289.2024.2429073.

本文引用的文献

Consequences of Lipid Remodeling of Adipocyte Membranes Being Functionally Distinct from Lipid Storage in Obesity.肥胖症中脂肪细胞膜的脂质重塑的功能不同于脂质储存的后果。

J Proteome Res. 2020 Oct 2;19(10):3919-3935. doi: 10.1021/acs.jproteome.9b00894. Epub 2020 Aug 31.

A Pilot Integrative Analysis of Colonic Gene Expression, Gut Microbiota, and Immune Infiltration in Primary Sclerosing Cholangitis-Inflammatory Bowel Disease: Association of Disease With Bile Acid Pathways.原发性硬化性胆管炎-炎症性肠病的结肠基因表达、肠道微生物群和免疫浸润的综合分析：疾病与胆汁酸途径的关联。

J Crohns Colitis. 2020 Jul 30;14(7):935-947. doi: 10.1093/ecco-jcc/jjaa021.

-Omics biomarker identification pipeline for translational medicine.组学生物标志物鉴定在转化医学中的应用

J Transl Med. 2019 May 14;17(1):155. doi: 10.1186/s12967-019-1912-5.

Artificial intelligence for precision oncology: beyond patient stratification.用于精准肿瘤学的人工智能：超越患者分层

NPJ Precis Oncol. 2019 Feb 25;3:6. doi: 10.1038/s41698-019-0078-1. eCollection 2019.

Can metabolomic profiling predict response to therapy?代谢组学分析能否预测治疗反应？

Nat Rev Rheumatol. 2019 Mar;15(3):129-130. doi: 10.1038/s41584-018-0136-z.

Random forest versus logistic regression: a large-scale benchmark experiment.随机森林与逻辑回归：大规模基准实验。

BMC Bioinformatics. 2018 Jul 17;19(1):270. doi: 10.1186/s12859-018-2264-5.

Human gut microbiome: hopes, threats and promises.人类肠道微生物组：希望、威胁和承诺。

Gut. 2018 Sep;67(9):1716-1725. doi: 10.1136/gutjnl-2018-316723. Epub 2018 Jun 22.

Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。

Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.

powsimR: power analysis for bulk and single cell RNA-seq experiments.powsimR：用于批量和单细胞 RNA-seq 实验的功效分析。

Bioinformatics. 2017 Nov 1;33(21):3486-3488. doi: 10.1093/bioinformatics/btx435.

RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers.RGIFE：一种用于识别生物标志物的排序引导迭代特征消除启发式方法。

BMC Bioinformatics. 2017 Jun 30;18(1):322. doi: 10.1186/s12859-017-1729-2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于随机森林的生物标志物发现和诊断研究功效分析框架。

A random forest based biomarker discovery and power analysis framework for diagnostics research.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献