Hu Chen, Steingrimsson Jon Arni
a Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center , Johns Hopkins University School of Medicine , Baltimore , MD , USA.
b Department of Biostatistics , School of Public Health, Brown University , Providence , RI , USA.
J Biopharm Stat. 2018;28(2):333-349. doi: 10.1080/10543406.2017.1377730. Epub 2017 Oct 19.
A crucial component of making individualized treatment decisions is to accurately predict each patient's disease risk. In clinical oncology, disease risks are often measured through time-to-event data, such as overall survival and progression/recurrence-free survival, and are often subject to censoring. Risk prediction models based on recursive partitioning methods are becoming increasingly popular largely due to their ability to handle nonlinear relationships, higher-order interactions, and/or high-dimensional covariates. The most popular recursive partitioning methods are versions of the Classification and Regression Tree (CART) algorithm, which builds a simple interpretable tree structured model. With the aim of increasing prediction accuracy, the random forest algorithm averages multiple CART trees, creating a flexible risk prediction model. Risk prediction models used in clinical oncology commonly use both traditional demographic and tumor pathological factors as well as high-dimensional genetic markers and treatment parameters from multimodality treatments. In this article, we describe the most commonly used extensions of the CART and random forest algorithms to right-censored outcomes. We focus on how they differ from the methods for noncensored outcomes, and how the different splitting rules and methods for cost-complexity pruning impact these algorithms. We demonstrate these algorithms by analyzing a randomized Phase III clinical trial of breast cancer. We also conduct Monte Carlo simulations to compare the prediction accuracy of survival forests with more commonly used regression models under various scenarios. These simulation studies aim to evaluate how sensitive the prediction accuracy is to the underlying model specifications, the choice of tuning parameters, and the degrees of missing covariates.
做出个性化治疗决策的一个关键组成部分是准确预测每个患者的疾病风险。在临床肿瘤学中,疾病风险通常通过事件发生时间数据来衡量,如总生存期和无进展/无复发生存期,并且常常受到删失的影响。基于递归划分方法的风险预测模型越来越受欢迎,主要是因为它们能够处理非线性关系、高阶相互作用和/或高维协变量。最流行的递归划分方法是分类与回归树(CART)算法的变体,它构建了一个简单可解释的树状结构模型。为了提高预测准确性,随机森林算法对多个CART树进行平均,创建了一个灵活的风险预测模型。临床肿瘤学中使用的风险预测模型通常既使用传统的人口统计学和肿瘤病理因素,也使用来自多模态治疗的高维基因标记和治疗参数。在本文中,我们描述了CART和随机森林算法针对右删失结局最常用的扩展。我们关注它们与针对非删失结局的方法有何不同,以及不同的分裂规则和成本复杂度剪枝方法如何影响这些算法。我们通过分析一项乳腺癌随机III期临床试验来演示这些算法。我们还进行了蒙特卡罗模拟,以比较生存森林与各种情况下更常用的回归模型的预测准确性。这些模拟研究旨在评估预测准确性对基础模型规范、调优参数的选择以及协变量缺失程度的敏感程度。