Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC, USA.
Perspectrix, Pittsboro, NC, USA.
Sci Rep. 2022 Mar 31;12(1):5440. doi: 10.1038/s41598-022-09415-2.
Regularized regression analysis is a mature analytic approach to identify weighted sums of variables predicting outcomes. We present a novel Coarse Approximation Linear Function (CALF) to frugally select important predictors and build simple but powerful predictive models. CALF is a linear regression strategy applied to normalized data that uses nonzero weights + 1 or - 1. Qualitative (linearly invariant) metrics to be optimized can be (for binary response) Welch (Student) t-test p-value or area under curve (AUC) of receiver operating characteristic, or (for real response) Pearson correlation. Predictor weighting is critically important when developing risk prediction models. While counterintuitive, it is a fact that qualitative metrics can favor CALF with ± 1 weights over algorithms producing real number weights. Moreover, while regression methods may be expected to change most or all weight values upon even small changes in input data (e.g., discarding a single subject of hundreds) CALF weights generally do not so change. Similarly, some regression methods applied to collinear or nearly collinear variables yield unpredictable magnitude or the direction (in p-space) of the weights as a vector. In contrast, with CALF if some predictors are linearly dependent or nearly so, CALF simply chooses at most one (the most informative, if any) and ignores the others, thus avoiding the inclusion of two or more collinear variables in the model.
正则化回归分析是一种成熟的分析方法,用于识别预测结果的变量加权和。我们提出了一种新颖的粗近似线性函数 (CALF),以节俭地选择重要预测因子并构建简单但强大的预测模型。CALF 是一种应用于归一化数据的线性回归策略,使用非零权重 +1 或 -1。可优化的定性(线性不变)度量标准可以是(对于二项响应) Welch(Student)t 检验 p 值或接收者操作特征曲线下的面积(AUC),或(对于实值响应)皮尔逊相关系数。在开发风险预测模型时,预测因子加权至关重要。虽然违反直觉,但事实是定性指标可以偏爱具有 ±1 权重的 CALF,而不是产生实数权重的算法。此外,虽然回归方法可能会在输入数据发生微小变化(例如,丢弃数百个中的一个)时更改大多数或所有权重值,但 CALF 权重通常不会发生变化。同样,一些应用于共线性或几乎共线性变量的回归方法会产生不可预测的权重大小或方向(在 p 空间)。相比之下,使用 CALF,如果某些预测因子是线性相关的或几乎如此,CALF 只需选择最多一个(如果有的话,最具信息量的),并忽略其他预测因子,从而避免将两个或更多共线性变量包含在模型中。