Suppr超能文献

机器学习和统计学习方法的预测性能:在“大数据量、小样本量”设置下,数据生成过程对外部有效性的影响。

Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the "large N, small p" setting.

机构信息

ICES, Toronto, ON, Canada.

Department of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada.

出版信息

Stat Methods Med Res. 2021 Jun;30(6):1465-1483. doi: 10.1177/09622802211002867. Epub 2021 Apr 13.

Abstract

Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI,  = 9484 vs.  = 7000) and patients hospitalized with congestive heart failure (CHF,  = 8240 vs.  = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized , Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets.

摘要

机器学习方法越来越被认为是提高临床结果预测的工具。我们旨在确定机器学习方法何时比传统学习方法表现更好。为此,我们研究了数据生成过程对六种机器学习和统计学习方法的相对预测准确性的影响:袋装分类树、基于树作为基学习器的随机梯度提升机、随机森林、套索、岭回归和无惩罚逻辑回归。我们在两个大型心血管数据集上进行了模拟,每个数据集都由来自不同时间的独立推导和验证样本组成:急性心肌梗死(AMI,n=9484 与 n=7000)和充血性心力衰竭(CHF,n=8240 与 n=7608)住院患者。我们使用基于六种学习方法中的每一种的六种数据生成过程来模拟推导和验证样本中的结果,分别基于 AMI 和 CHF 数据集中的 33 和 28 个预测因子。我们在每个模拟推导样本中应用了六种预测方法,并根据 c 统计量、广义、Brier 评分和校准来评估模拟验证样本中的性能。虽然没有一种方法在所有六种数据生成过程和八种性能指标中都具有统一的优势性能,但(无)惩罚逻辑回归和提升树在多种数据生成过程和性能指标中往往比其他方法具有更好的性能。本研究证实,在具有大数据集的低维环境中,经典统计学习方法表现良好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/48df/8188999/5aa8a196b5b6/10.1177_09622802211002867-fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验