Machine Intelligence in Clinical Neuroscience (MICN) Laboratory, Department of Neurosurgery, Clinical Neuroscience Center, University Hospital Zurich, University of Zurich, Zurich, Switzerland.
Neurosurgical Artificial Intelligence Laboratory Aachen (NAILA), Department of Neurosurgery, RWTH Aachen University Hospital, Aachen, Germany.
Acta Neurochir Suppl. 2022;134:33-41. doi: 10.1007/978-3-030-85292-4_5.
We illustrate the steps required to train and validate a simple, machine learning-based clinical prediction model for any binary outcome, such as, for example, the occurrence of a complication, in the statistical programming language R. To illustrate the methods applied, we supply a simulated database of 10,000 glioblastoma patients who underwent microsurgery, and predict the occurrence of 12-month survival. We walk the reader through each step, including import, checking, and splitting of datasets. In terms of pre-processing, we focus on how to practically implement imputation using a k-nearest neighbor algorithm, and how to perform feature selection using recursive feature elimination. When it comes to training models, we apply the theory discussed in Parts I-III. We show how to implement bootstrapping and to evaluate and select models based on out-of-sample error. Specifically for classification, we discuss how to counteract class imbalance by using upsampling techniques. We discuss how the reporting of a minimum of accuracy, area under the curve (AUC), sensitivity, and specificity for discrimination, as well as slope and intercept for calibration-if possible alongside a calibration plot-is paramount. Finally, we explain how to arrive at a measure of variable importance using a universal, AUC-based method. We provide the full, structured code, as well as the complete glioblastoma survival database for the readers to download and execute in parallel to this section.
我们将在统计编程语言 R 中展示训练和验证简单的基于机器学习的临床预测模型所需的步骤,该模型用于预测任何二分类结果,例如并发症的发生。为了说明应用的方法,我们提供了一个模拟的 10000 名接受微创手术的胶质母细胞瘤患者的数据库,并预测 12 个月的生存率。我们引导读者完成每个步骤,包括数据集的导入、检查和分割。在预处理方面,我们重点介绍如何使用 k-最近邻算法实际实现插补,以及如何使用递归特征消除进行特征选择。在训练模型方面,我们应用了第一至第三部分中讨论的理论。我们展示了如何实现自举,并根据样本外误差评估和选择模型。特别是对于分类,我们讨论了如何通过使用上采样技术来克服类别不平衡。我们讨论了报告最低的准确性、曲线下面积 (AUC)、敏感性和特异性用于区分,以及斜率和截距用于校准(如果可能的话,同时提供校准图)是至关重要的。最后,我们解释了如何使用基于 AUC 的通用方法来获得变量重要性的度量。我们为读者提供了完整的、结构化的代码,以及完整的胶质母细胞瘤生存数据库,以便读者可以下载并与本节内容并行执行。