Yao Sijie, Wang Xuefeng
Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL, USA.
Methods Mol Biol. 2023;2629:11-21. doi: 10.1007/978-1-0716-2986-4_2.
Discovering molecular biomarkers for predicting patient survival outcomes is an essential step toward improving prognosis and therapeutic decision-making in the treatment of severe diseases such as cancer. Due to the high-dimensionality nature of omics datasets, statistical methods such as the least absolute shrinkage and selection operator (Lasso) have been widely applied for cancer biomarker discovery. Due to their scalability and demonstrated prediction performance, machine learning methods such as XGBoost and neural network models have also been gaining popularity in the community recently. However, compared to more traditional survival methods such as Kaplan-Meier and Cox regression methods, high-dimensional methods for survival outcomes are still less well known to biomedical researchers. In this chapter, we will discuss the key analytical procedures in employing these methods for identifying biomarkers associated with survival data. We will also identify important considerations that emerged from the analysis of actual omics data. Some typical instances of misapplication and misinterpretation of machine learning methods will also be discussed. Using lung cancer and head and neck cancer datasets as demonstrations, we provide step-by-step instructions and sample R codes for prioritizing prognostic biomarkers.
发现用于预测患者生存结果的分子生物标志物是改善癌症等严重疾病预后和治疗决策的关键一步。由于组学数据集具有高维性,统计方法如最小绝对收缩和选择算子(Lasso)已被广泛应用于癌症生物标志物的发现。由于其可扩展性和已证明的预测性能,机器学习方法如XGBoost和神经网络模型最近在该领域也越来越受欢迎。然而,与更传统的生存方法如Kaplan-Meier法和Cox回归方法相比,用于生存结果的高维方法对生物医学研究人员来说仍然不太为人所知。在本章中,我们将讨论使用这些方法识别与生存数据相关的生物标志物的关键分析程序。我们还将确定从实际组学数据分析中出现的重要注意事项。还将讨论机器学习方法一些典型的误用和误解情况。以肺癌和头颈癌数据集为例,我们提供了用于确定预后生物标志物优先级的分步说明和示例R代码。