Demler Olga V, Pencina Michael J, D'Agostino Ralph B
Brigham and Women's Hospital, Division of Preventive Medicine, Harvard Medical School, 900 Commonwealth Avenue, Boston, MA 02118, USA.
Stat Med. 2013 Oct 30;32(24):4196-210. doi: 10.1002/sim.5824. Epub 2013 May 3.
In this paper, we investigate how the correlation structure of independent variables affects the discrimination of risk prediction model. Using multivariate normal data and binary outcome, we prove that zero correlation among predictors is often detrimental for discrimination in a risk prediction model and negatively correlated predictors with positive effect sizes are beneficial. A very high multiple R-squared from regressing the new predictor on the old ones can also be beneficial. As a practical guide to new variable selection, we recommend to select predictors that have negative correlation with the risk score based on the existing variables. This step is easy to implement even when the number of new predictors is large. We illustrate our results by using real-life Framingham data suggesting that the conclusions hold outside of normality. The findings presented in this paper might be useful for preliminary selection of potentially important predictors, especially is situations where the number of predictors is large.
在本文中,我们研究了自变量的相关结构如何影响风险预测模型的判别能力。使用多元正态数据和二元结局,我们证明预测变量之间的零相关性通常对风险预测模型的判别能力不利,而具有正效应大小的负相关预测变量则是有益的。将新预测变量对旧预测变量进行回归得到的非常高的复相关系数(multiple R-squared)也可能是有益的。作为新变量选择的实用指南,我们建议根据现有变量选择与风险评分呈负相关的预测变量。即使新预测变量数量众多,这一步骤也易于实施。我们通过使用真实的弗雷明汉数据来说明我们的结果,表明这些结论在非正态情况下也成立。本文提出的数据可能有助于潜在重要预测变量的初步选择,特别是在预测变量数量众多的情况下。