Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Heidelberg, Germany.
Department of Neuroradiology, University Medical Center, Medical Faculty Mannheim of Heidelberg University, Mannheim, Germany.
Nat Protoc. 2020 Feb;15(2):479-512. doi: 10.1038/s41596-019-0251-6. Epub 2020 Jan 13.
DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth's penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.
基于 DNA 甲基化数据的精准癌症诊断正成为分子肿瘤分类的最新技术。对于这些通常高度多类分类任务,选择具有良好校准概率估计的统计方法的标准仍然缺乏。为了支持这种选择,我们评估了成熟的机器学习(ML)分类器,包括随机森林(RFs)、弹性网络(ELNET)、支持向量机(SVMs)和增强树,并结合后处理算法开发了允许无偏类概率(CP)估计的 ML 工作流程。校准器包括岭惩罚多项逻辑回归(MR)和通过拟合逻辑回归(LR)和 Firth 惩罚 LR 的 Platt 缩放。我们使用 5×5 嵌套交叉验证方案在最近发表的一个包含 2801 个样本和 91 个诊断类别的脑肿瘤 450k DNA 甲基化队列上比较了这些工作流程,并在外部数据来自癌症基因组图谱上证明了它们的通用性。ELNET 是独立分类器中表现最好的,具有最佳的校准曲线。最佳的两阶段工作流程是具有线性核的 MR 校准 SVM,紧随其后的是 Ridge 校准调整后的 RF。对于校准,无论主要分类器如何,MR 都是最有效的。由于这些比较而开发的协议为选择 ML 工作流程及其调整提供了有价值的指导,以便使用 DNA 甲基化数据进行精准诊断生成良好校准的 CP 估计。计算时间取决于 ML 算法,从使用多核桌面 PC 的 <15 分钟到 5 天不等。在 GitHub 上提供了针对具有中级生物信息学和统计学经验的用户的开源 R 语言中的详细脚本,并使用带有 Bioconductor 扩展的 R。