机器学习工作流程，用于估计 DNA 甲基化微阵列数据精准癌症诊断的类别概率。

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

机构信息

Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Heidelberg, Germany.

Department of Neuroradiology, University Medical Center, Medical Faculty Mannheim of Heidelberg University, Mannheim, Germany.

出版信息

Nat Protoc. 2020 Feb;15(2):479-512. doi: 10.1038/s41596-019-0251-6. Epub 2020 Jan 13.

DOI:10.1038/s41596-019-0251-6

PMID:31932775

Abstract

DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth's penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.

摘要

基于 DNA 甲基化数据的精准癌症诊断正成为分子肿瘤分类的最新技术。对于这些通常高度多类分类任务，选择具有良好校准概率估计的统计方法的标准仍然缺乏。为了支持这种选择，我们评估了成熟的机器学习（ML）分类器，包括随机森林（RFs）、弹性网络（ELNET）、支持向量机（SVMs）和增强树，并结合后处理算法开发了允许无偏类概率（CP）估计的 ML 工作流程。校准器包括岭惩罚多项逻辑回归（MR）和通过拟合逻辑回归（LR）和 Firth 惩罚 LR 的 Platt 缩放。我们使用 5×5 嵌套交叉验证方案在最近发表的一个包含 2801 个样本和 91 个诊断类别的脑肿瘤 450k DNA 甲基化队列上比较了这些工作流程，并在外部数据来自癌症基因组图谱上证明了它们的通用性。ELNET 是独立分类器中表现最好的，具有最佳的校准曲线。最佳的两阶段工作流程是具有线性核的 MR 校准 SVM，紧随其后的是 Ridge 校准调整后的 RF。对于校准，无论主要分类器如何，MR 都是最有效的。由于这些比较而开发的协议为选择 ML 工作流程及其调整提供了有价值的指导，以便使用 DNA 甲基化数据进行精准诊断生成良好校准的 CP 估计。计算时间取决于 ML 算法，从使用多核桌面 PC 的 <15 分钟到 5 天不等。在 GitHub 上提供了针对具有中级生物信息学和统计学经验的用户的开源 R 语言中的详细脚本，并使用带有 Bioconductor 扩展的 R。

相似文献

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

Nat Protoc. 2020 Feb;15(2):479-512. doi: 10.1038/s41596-019-0251-6. Epub 2020 Jan 13.

Probability calibration-based prediction of recurrence rate in patients with diffuse large B-cell lymphoma.

BioData Min. 2021 Aug 13;14(1):38. doi: 10.1186/s13040-021-00272-9.

Machine learning models predict the primary sites of head and neck squamous cell carcinoma metastases based on DNA methylation.

J Pathol. 2022 Apr;256(4):378-387. doi: 10.1002/path.5845. Epub 2022 Jan 20.

MethPed: an R package for the identification of pediatric brain tumor subtypes.

BMC Bioinformatics. 2016 Jul 2;17(1):262. doi: 10.1186/s12859-016-1144-0.

Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory.

Biom J. 2014 Jul;56(4):534-63. doi: 10.1002/bimj.201300068. Epub 2014 Jan 29.

Bias in error estimation when using cross-validation for model selection.

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates.

J Theor Biol. 2009 Aug 7;259(3):533-40. doi: 10.1016/j.jtbi.2009.04.013. Epub 2009 May 3.

Workflows for microarray data processing in the Kepler environment.

BMC Bioinformatics. 2012 May 17;13:102. doi: 10.1186/1471-2105-13-102.

Classification of gene microarrays by penalized logistic regression.

Biostatistics. 2004 Jul;5(3):427-43. doi: 10.1093/biostatistics/5.3.427.

Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data.

BMC Bioinformatics. 2011 May 9;12:138. doi: 10.1186/1471-2105-12-138.

引用本文的文献

Towards machine learning fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf398.

Rapid diagnosis of adult-type diffuse glioma using a layered scheme.

BMC Med. 2025 Jun 2;23(1):325. doi: 10.1186/s12916-025-04124-9.

Machine learning-based forecasting of daily acute ischemic stroke admissions using weather data.

NPJ Digit Med. 2025 Apr 25;8(1):225. doi: 10.1038/s41746-025-01619-w.

Towards machine learning fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients.

bioRxiv. 2025 Feb 19:2025.02.14.638368. doi: 10.1101/2025.02.14.638368.

Explainable artificial intelligence of DNA methylation-based brain tumor diagnostics.

Nat Commun. 2025 Feb 20;16(1):1787. doi: 10.1038/s41467-025-57078-0.

F-Fluoro-2-Deoxyglucose Positron Emission Tomography/Computed Tomography Measures of Spatial Heterogeneity for Predicting Platinum Resistance of High-Grade Serous Ovarian Cancer.

Cancer Med. 2024 Oct;13(20):e70287. doi: 10.1002/cam4.70287.

Emerging research trends in artificial intelligence for cancer diagnostic systems: A comprehensive review.

Heliyon. 2024 Aug 23;10(17):e36743. doi: 10.1016/j.heliyon.2024.e36743. eCollection 2024 Sep 15.

Eye Movement Abnormalities Can Distinguish First-Episode Schizophrenia, Chronic Schizophrenia, and Prodromal Patients From Healthy Controls.

Schizophr Bull Open. 2023 Jan 3;4(1):sgac076. doi: 10.1093/schizbullopen/sgac076. eCollection 2023 Jan.

Comprehensive application of AI algorithms with TCR NGS data for glioma diagnosis.

Sci Rep. 2024 Jul 4;14(1):15361. doi: 10.1038/s41598-024-65305-9.

Donor whole blood DNA methylation is not a strong predictor of acute graft host disease in unrelated donor allogeneic haematopoietic cell transplantation.

Front Genet. 2024 Apr 3;15:1242636. doi: 10.3389/fgene.2024.1242636. eCollection 2024.

本文引用的文献

Second-generation molecular subgrouping of medulloblastoma: an international meta-analysis of Group 3 and Group 4 subtypes.

Acta Neuropathol. 2019 Aug;138(2):309-326. doi: 10.1007/s00401-019-02020-0. Epub 2019 May 10.

Machine Learning in Medicine.

N Engl J Med. 2019 Apr 4;380(14):1347-1358. doi: 10.1056/NEJMra1814259.

Identifying facial phenotypes of genetic disorders using deep learning.

Nat Med. 2019 Jan;25(1):60-64. doi: 10.1038/s41591-018-0279-0. Epub 2019 Jan 7.

Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies.

Nat Protoc. 2018 Dec;13(12):2742-2757. doi: 10.1038/s41596-018-0073-y.

Machine learning at the energy and intensity frontiers of particle physics.

Nature. 2018 Aug;560(7716):41-48. doi: 10.1038/s41586-018-0361-2. Epub 2018 Aug 1.

Machine learning for molecular and materials science.

Nature. 2018 Jul;559(7715):547-555. doi: 10.1038/s41586-018-0337-2. Epub 2018 Jul 25.

Practical implementation of DNA methylation and copy-number-based CNS tumor diagnostics: the Heidelberg experience.

Acta Neuropathol. 2018 Aug;136(2):181-210. doi: 10.1007/s00401-018-1879-y. Epub 2018 Jul 2.

Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer.

Cell. 2018 Apr 5;173(2):291-304.e6. doi: 10.1016/j.cell.2018.03.022.

Adjusting for Batch Effects in DNA Methylation Microarray Data, a Lesson Learned.

Front Genet. 2018 Mar 16;9:83. doi: 10.3389/fgene.2018.00083. eCollection 2018.

DNA methylation-based classification of central nervous system tumours.

Nature. 2018 Mar 22;555(7697):469-474. doi: 10.1038/nature26000. Epub 2018 Mar 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

机器学习工作流程，用于估计 DNA 甲基化微阵列数据精准癌症诊断的类别概率。

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

机构信息

Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Heidelberg, Germany.

Department of Neuroradiology, University Medical Center, Medical Faculty Mannheim of Heidelberg University, Mannheim, Germany.

出版信息

Nat Protoc. 2020 Feb;15(2):479-512. doi: 10.1038/s41596-019-0251-6. Epub 2020 Jan 13.

DOI:10.1038/s41596-019-0251-6

PMID:31932775

Abstract

摘要

机器学习工作流程，用于估计 DNA 甲基化微阵列数据精准癌症诊断的类别概率。

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

机器学习工作流程，用于估计 DNA 甲基化微阵列数据精准癌症诊断的类别概率。

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献