Goedhart Jeroen M, Klausch Thomas, Janssen Jurriaan, van de Wiel Mark A
Department of Epidemiology & Data Science, Amsterdam Public Health Research Institute, Amsterdam University Medical Centers Location AMC, Noord Holland, The Netherlands.
Department of Pathology, Cancer Center Amsterdam, Amsterdam University Medical Centers Location VUMC, Noord Holland, The Netherlands.
Stat Med. 2025 Feb 28;44(5):e70004. doi: 10.1002/sim.70004.
For clinical prediction applications, we are often faced with small sample size data compared to the number of covariates. Such data pose problems for variable selection and prediction, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate external information on the covariates into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate external information, an empirical Bayes (EB) framework is developed that estimates, assisted by a model, prior covariate weights in the BART model. The proposed EB framework enables the estimation of the other prior parameters of BART as well, rendering an appealing and computationally efficient alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is non-linear, the method benefits from the flexibility of BART to outperform regression-based learners. Finally, the benefit of incorporating external information is shown in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data.
对于临床预测应用,与协变量的数量相比,我们经常面临小样本量的数据。此类数据给变量选择和预测带来了问题,尤其是当协变量与反应的关系复杂时。为应对这些挑战,我们建议将协变量的外部信息纳入贝叶斯加法回归树(BART),这是一种树之和预测模型,它利用树参数上的先验来防止过拟合。为纳入外部信息,开发了一个经验贝叶斯(EB)框架,该框架在一个模型的辅助下估计BART模型中的先验协变量权重。所提出的EB框架还能够估计BART的其他先验参数,从而提供了一种比交叉验证更具吸引力且计算效率更高的替代方法。我们表明,该方法能够找到相关协变量,并且在模拟中与默认的BART相比改进了预测。如果协变量与反应的关系是非线性的,该方法将受益于BART的灵活性,从而优于基于回归的学习器。最后,在基于临床协变量、基因突变、DNA易位和DNA拷贝数数据的弥漫性大B细胞淋巴瘤预后应用中展示了纳入外部信息的益处。