Suppr超能文献

森林方法在生存时间数据中的比较研究:变量选择和预测性能。

A comparative study of forest methods for time-to-event data: variable selection and predictive performance.

机构信息

Department of Biostatistics, School of Public Health (Guangdong Provincial Key Laboratory of Tropical Disease Research), Southern Medical University, Guangzhou, Guangdong, China.

出版信息

BMC Med Res Methodol. 2021 Sep 25;21(1):193. doi: 10.1186/s12874-021-01386-8.

Abstract

BACKGROUND

As a hot method in machine learning field, the forests approach is an attractive alternative approach to Cox model. Random survival forests (RSF) methodology is the most popular survival forests method, whereas its drawbacks exist such as a selection bias towards covariates with many possible split points. Conditional inference forests (CIF) methodology is known to reduce the selection bias via a two-step split procedure implementing hypothesis tests as it separates the variable selection and splitting, but its computation costs too much time. Random forests with maximally selected rank statistics (MSR-RF) methodology proposed recently seems to be a great improvement on RSF and CIF.

METHODS

In this paper we used simulation study and real data application to compare prediction performances and variable selection performances among three survival forests methods, including RSF, CIF and MSR-RF. To evaluate the performance of variable selection, we combined all simulations to calculate the frequency of ranking top of the variable importance measures of the correct variables, where higher frequency means better selection ability. We used Integrated Brier Score (IBS) and c-index to measure the prediction accuracy of all three methods. The smaller IBS value, the greater the prediction.

RESULTS

Simulations show that three forests methods differ slightly in prediction performance. MSR-RF and RSF might perform better than CIF when there are only continuous or binary variables in the datasets. For variable selection performance, When there are multiple categorical variables in the datasets, the selection frequency of RSF seems to be lowest in most cases. MSR-RF and CIF have higher selection rates, and CIF perform well especially with the interaction term. The fact that correlation degree of the variables has little effect on the selection frequency indicates that three forest methods can handle data with correlation. When there are only continuous variables in the datasets, MSR-RF perform better. When there are only binary variables in the datasets, RSF and MSR-RF have more advantages than CIF. When the variable dimension increases, MSR-RF and RSF seem to be more robustthan CIF CONCLUSIONS: All three methods show advantages in prediction performances and variable selection performances under different situations. The recent proposed methodology MSR-RF possess practical value and is well worth popularizing. It is important to identify the appropriate method in real use according to the research aim and the nature of covariates.

摘要

背景

作为机器学习领域的热门方法,森林方法是 Cox 模型的一种有吸引力的替代方法。随机生存森林(RSF)方法是最流行的生存森林方法,但它存在选择偏倚的缺点,偏向于具有许多可能分裂点的协变量。条件推断森林(CIF)方法通过实施假设检验的两步分裂过程来减少选择偏倚,因为它将变量选择和分裂分开,但它的计算成本太高。最近提出的基于最大选择秩统计量的随机森林(MSR-RF)方法似乎是对 RSF 和 CIF 的重大改进。

方法

本文通过模拟研究和实际数据应用,比较了三种生存森林方法(RSF、CIF 和 MSR-RF)的预测性能和变量选择性能。为了评估变量选择的性能,我们将所有模拟结合起来计算正确变量的变量重要性度量的排名频率,频率越高表示选择能力越好。我们使用综合 Brier 得分(IBS)和 c 指数来衡量所有三种方法的预测准确性。IBS 值越小,预测效果越好。

结果

模拟结果表明,三种森林方法在预测性能上略有差异。当数据集仅包含连续或二进制变量时,MSR-RF 和 RSF 的性能可能优于 CIF。对于变量选择性能,当数据集包含多个分类变量时,在大多数情况下,RSF 的选择频率似乎最低。MSR-RF 和 CIF 具有更高的选择率,并且 CIF 的表现尤其出色,特别是对于交互项。变量之间的相关程度对选择频率的影响很小,这表明三种森林方法可以处理具有相关性的数据。当数据集仅包含连续变量时,MSR-RF 的性能更好。当数据集仅包含二进制变量时,RSF 和 MSR-RF 比 CIF 具有更多优势。当变量维度增加时,MSR-RF 和 RSF 似乎比 CIF 更稳健。

结论

在不同情况下,这三种方法在预测性能和变量选择性能方面都具有优势。最近提出的 MSR-RF 方法具有实用价值,非常值得推广。在实际使用中,根据研究目的和协变量的性质,确定合适的方法非常重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/29e9/8465777/f0bcbfaf31ac/12874_2021_1386_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验