Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany; Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany.
Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
Artif Intell Med. 2023 Sep;143:102589. doi: 10.1016/j.artmed.2023.102589. Epub 2023 Jun 1.
DNA methylation biomarkers have great potential in improving prognostic classification systems for patients with cancer. Machine learning (ML)-based analytic techniques might help overcome the challenges of analyzing high-dimensional data in relatively small sample sizes. This systematic review summarizes the current use of ML-based methods in epigenome-wide studies for the identification of DNA methylation signatures associated with cancer prognosis.
We searched three electronic databases including PubMed, EMBASE, and Web of Science for articles published until 2 January 2023. ML-based methods and workflows used to identify DNA methylation signatures associated with cancer prognosis were extracted and summarized. Two authors independently assessed the methodological quality of included studies by a seven-item checklist adapted from 'A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies (PROBAST)' and from the 'Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK). Different ML methods and workflows used in included studies were summarized and visualized by a sunburst chart, a bubble chart, and Sankey diagrams, respectively.
Eighty-three studies were included in this review. Three major types of ML-based workflows were identified. 1) unsupervised clustering, 2) supervised feature selection, and 3) deep learning-based feature transformation. For the three workflows, the most frequently used ML techniques were consensus clustering, least absolute shrinkage and selection operator (LASSO), and autoencoder, respectively. The systematic review revealed that the performance of these approaches has not been adequately evaluated yet and that methodological and reporting flaws were common in the identified studies using ML techniques.
There is great heterogeneity in ML-based methodological strategies used by epigenome-wide studies to identify DNA methylation markers associated with cancer prognosis. In theory, most existing workflows could not handle the high multi-collinearity and potentially non-linearity interactions in epigenome-wide DNA methylation data. Benchmarking studies are needed to compare the relative performance of various approaches for specific cancer types. Adherence to relevant methodological and reporting guidelines are urgently needed.
DNA 甲基化生物标志物在改善癌症患者的预后分类系统方面具有巨大潜力。基于机器学习 (ML) 的分析技术可能有助于克服在相对较小的样本量中分析高维数据的挑战。本系统综述总结了目前在全基因组 DNA 甲基化研究中使用基于 ML 的方法来识别与癌症预后相关的 DNA 甲基化特征的情况。
我们在 PubMed、EMBASE 和 Web of Science 这三个电子数据库中进行了检索,检索时间截至 2023 年 1 月 2 日。提取并总结了用于识别与癌症预后相关的 DNA 甲基化特征的基于 ML 的方法和工作流程。两位作者独立使用从“用于评估预测模型研究偏倚和适用性的工具 (PROBAST)”和“用于肿瘤标志物预后研究的报告建议 (REMARK)”改编而来的七项检查表评估纳入研究的方法学质量。使用旭日图、气泡图和桑基图分别对纳入研究中使用的不同 ML 方法和工作流程进行了总结和可视化。
本综述共纳入 83 项研究。确定了三种主要类型的基于 ML 的工作流程。1)无监督聚类,2)有监督特征选择,3)基于深度学习的特征转换。对于这三种工作流程,最常用的 ML 技术分别是共识聚类、最小绝对收缩和选择算子 (LASSO) 和自动编码器。系统综述显示,这些方法的性能尚未得到充分评估,并且在使用 ML 技术的已识别研究中,方法学和报告缺陷很常见。
用于识别与癌症预后相关的 DNA 甲基化标志物的全基因组 DNA 甲基化研究中,基于 ML 的方法策略存在很大的异质性。从理论上讲,大多数现有的工作流程都无法处理全基因组 DNA 甲基化数据中高度的多重共线性和潜在的非线性相互作用。需要进行基准研究来比较各种方法在特定癌症类型中的相对性能。迫切需要遵守相关的方法学和报告指南。