Arnes Jo Inge, Hapfelmeier Alexander, Horsch Alexander, Braaten Tonje
Department of Computer Science, Faculty of Science and Technology, UiT The Arctic University of Norway, Tromsø, Norway.
Institute of AI and Informatics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany.
Front Epidemiol. 2023 Dec 18;3:1283705. doi: 10.3389/fepid.2023.1283705. eCollection 2023.
Non-linear regression modeling is common in epidemiology for prediction purposes or estimating relationships between predictor and response variables. Restricted cubic spline (RCS) regression is one such method, for example, highly relevant to Cox proportional hazard regression model analysis. RCS regression uses third-order polynomials joined at knot points to model non-linear relationships. The standard approach is to place knots by a regular sequence of quantiles between the outer boundaries. A regression curve can easily be fitted to the sample using a relatively high number of knots. The problem is then overfitting, where a regression model has a good fit to the given sample but does not generalize well to other samples. A low knot count is thus preferred. However, the standard knot selection process can lead to underperformance in the sparser regions of the predictor variable, especially when using a low number of knots. It can also lead to overfitting in the denser regions. We present a simple greedy search algorithm using a backward method for knot selection that shows reduced prediction error and Bayesian information criterion scores compared to the standard knot selection process in simulation experiments. We have implemented the algorithm as part of an open-source R-package, knutar.
非线性回归建模在流行病学中常用于预测目的或估计预测变量与响应变量之间的关系。受限立方样条(RCS)回归就是这样一种方法,例如,它与Cox比例风险回归模型分析高度相关。RCS回归使用在节点处连接的三阶多项式来模拟非线性关系。标准方法是通过外边界之间的分位数的规则序列来放置节点。使用相对较多的节点可以很容易地将回归曲线拟合到样本上。问题在于过拟合,即回归模型对给定样本拟合良好,但对其他样本的泛化能力不佳。因此,节点数量较少更为可取。然而,标准的节点选择过程可能会导致在预测变量的稀疏区域表现不佳,尤其是在使用较少节点时。它还可能导致在密集区域出现过拟合。我们提出了一种简单的贪心搜索算法,使用向后方法进行节点选择,在模拟实验中,与标准节点选择过程相比,该算法显示出预测误差和贝叶斯信息准则得分有所降低。我们已将该算法作为开源R包knutar的一部分来实现。