Yang Shuo, Su Huaan, Zhang Nanxiang, Han Yuduan, Ge Yingfeng, Fei Yi, Liu Ying, Hilowle Abdullahi, Xu Peng, Zhang Jinxin
Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, China.
The People's Hospital of Jiangmen, No. 172 Gaodi Li, Pengjiang District, Jiangmen, Guangdong, 529000, China.
BMC Med Res Methodol. 2025 Mar 12;25(1):70. doi: 10.1186/s12874-025-02522-4.
Assuming a linear relationship between continuous predictors and outcomes in clinical prediction models is often inappropriate, as true linear relationships are rare, potentially resulting in biased estimates and inaccurate conclusions. Our research group addressed a single U-shaped independent variable before. Multiple U-shaped predictors can improve predictive accuracy by capturing nuanced relationships, but they also introduce challenges like increased complexity and potential overfitting. This study aims to extend the applicability of our previous research results to more common scenarios, thereby facilitating more comprehensive and practical investigations.
In this study, we proposed a novel approach called the Recursive Gradient Scanning Method (RGS) for discretizing multiple continuous variables that exhibit U-shaped relationships with the natural logarithm of the odds ratio (lnOR). The RGS method involves a two-step approach: first, it conducts fine screening from the 2.5th to 97.5th percentiles of the lnOR. Then, it utilizes an iterative process that compares AIC metrics to identify optimal categorical variables. We conducted a Monte Carlo simulation study to investigate the performance of the RGS method. Different correlation levels, sample sizes, missing rates, and symmetry levels of U-shaped relationships were considered in the simulation process. To compare the RGS method with other common approaches (such as median, Q-Q, minimum P-value method), we assessed both the predictive ability (e.g., AUC) and goodness of fit (e.g., AIC) of logistic regression models with variables discretized at different cut-points using a real dataset.
Both simulation and empirical studies have consistently demonstrated the effectiveness of the RGS method. In simulation studies, the RGS method showed superior performance compared to other common discretization methods in discrimination ability and overall performance for logistic regression models across various U-shaped scenarios (with varying correlation levels, sample sizes, missing rates, and symmetry levels of U-shaped relationships). Similarly, empirical study showed that the optimal cut-points identified by RGS have superior clinical predictive power, as measured by metrics such as AUC, compared to other traditional methods.
The simulation and empirical study demonstrated that the RGS method outperformed other common discretization methods in terms of goodness of fit and predictive ability. However, in the future, we will focus on addressing challenges related to separation or missing binary responses, and we will require more data to validate our method.
在临床预测模型中,假设连续预测变量与结果之间存在线性关系通常是不合适的,因为真正的线性关系很少见,这可能会导致估计偏差和结论不准确。我们的研究小组之前处理过单个U形自变量。多个U形预测变量可以通过捕捉细微的关系来提高预测准确性,但它们也带来了一些挑战,如复杂性增加和潜在的过拟合。本研究旨在将我们之前的研究结果的适用性扩展到更常见的场景,从而促进更全面和实际的研究。
在本研究中,我们提出了一种名为递归梯度扫描法(RGS)的新方法,用于离散化与比值比自然对数(lnOR)呈U形关系的多个连续变量。RGS方法包括两步:首先,它从lnOR的第2.5百分位数到第97.5百分位数进行精细筛选。然后,它利用一个迭代过程,通过比较AIC指标来识别最优分类变量。我们进行了一项蒙特卡洛模拟研究,以调查RGS方法的性能。在模拟过程中考虑了不同的相关水平、样本量、缺失率和U形关系的对称水平。为了将RGS方法与其他常见方法(如中位数法、Q-Q法、最小P值法)进行比较,我们使用一个真实数据集评估了在不同切点处离散化变量的逻辑回归模型的预测能力(如AUC)和拟合优度(如AIC)。
模拟和实证研究都一致证明了RGS方法的有效性。在模拟研究中,在各种U形场景(具有不同的相关水平、样本量、缺失率和U形关系的对称水平)下,RGS方法在逻辑回归模型的区分能力和整体性能方面表现优于其他常见的离散化方法。同样,实证研究表明,与其他传统方法相比,RGS确定的最优切点具有更高的临床预测能力,以AUC等指标衡量。
模拟和实证研究表明,RGS方法在拟合优度和预测能力方面优于其他常见的离散化方法。然而,未来我们将专注于解决与二元反应分离或缺失相关的挑战,并且我们将需要更多数据来验证我们的方法。