Qu Kexin, Gainey Monique, Kanekar Samika S, Nasrim Sabiha, Nelson Eric J, Garbern Stephanie C, Monjory Mahmuda, Alam Nur H, Levine Adam C, Schmid Christopher H
Department of Biostatistics, Brown University, Providence, Rhode Island, United States of America.
Department of Emergency Medicine, Rhode Island Hospital, Providence, Rhode Island, United States of America.
PLOS Digit Health. 2025 May 6;4(5):e0000820. doi: 10.1371/journal.pdig.0000820. eCollection 2025 May.
Many comparisons of statistical regression and machine learning algorithms to build clinical predictive models use inadequate methods to build regression models and do not have proper independent test sets on which to externally validate the models. Proper comparisons for models of ordinal categorical outcomes do not exist. We set out to compare model discrimination for four regression and machine learning methods in a case study predicting the ordinal outcome of severe, some, or no dehydration among patients with acute diarrhea presenting to a large medical center in Bangladesh using data from the NIRUDAK study derivation and validation cohorts. Proportional Odds Logistic Regression (POLR), penalized ordinal regression (RIDGE), classification trees (CART), and random forest (RF) models were built to predict dehydration severity and compared using three ordinal discrimination indices: ordinal c-index (ORC), generalized c-index (GC), and average dichotomous c-index (ADC). Performance was evaluated on models developed on the training data, on the same models applied to an external test set and through internal validation with three bootstrap algorithms to correct for overoptimism. RF had superior discrimination on the original training data set, but its performance was more similar to the other three methods after internal validation using the bootstrap. Performance for all models was lower on the prospective test dataset, with particularly large reduction for RF and RIDGE. POLR had the best performance in the test dataset and was also most efficient, with the smallest final model size. Clinical prediction models for ordinal outcomes, just like those for binary and continuous outcomes, need to be prospectively validated on external test sets if possible because internal validation may give a too optimistic picture of model performance. Regression methods can perform as well as more automated machine learning methods if constructed with attention to potential nonlinear associations. Because regression models are often more interpretable clinically, their use should be encouraged.
许多用于构建临床预测模型的统计回归和机器学习算法比较,在构建回归模型时采用的方法并不充分,且没有合适的独立测试集来对模型进行外部验证。对于有序分类结果模型,不存在恰当的比较方法。我们开展了一项案例研究,旨在比较四种回归和机器学习方法的模型判别能力,该研究使用来自NIRUDAK研究推导和验证队列的数据,预测在孟加拉国一家大型医疗中心就诊的急性腹泻患者出现严重、部分或无脱水的有序结果。构建了比例优势逻辑回归(POLR)、惩罚有序回归(RIDGE)、分类树(CART)和随机森林(RF)模型来预测脱水严重程度,并使用三个有序判别指数进行比较:有序c指数(ORC)、广义c指数(GC)和平均二分c指数(ADC)。在基于训练数据开发的模型上、应用于外部测试集的相同模型上以及通过三种自助法算法进行内部验证以校正过度乐观的情况下,对性能进行了评估。RF在原始训练数据集上具有卓越的判别能力,但在使用自助法进行内部验证后,其性能与其他三种方法更为相似。在前瞻性测试数据集上,所有模型的性能均较低,RF和RIDGE的下降尤为明显。POLR在测试数据集中表现最佳,效率也最高,最终模型规模最小。与二元和连续结果的临床预测模型一样,有序结果的临床预测模型如果可能的话,需要在外部测试集上进行前瞻性验证,因为内部验证可能会对模型性能给出过于乐观的描述。如果在构建回归方法时关注潜在的非线性关联,其性能可以与更自动化的机器学习方法相媲美。由于回归模型在临床上通常更具可解释性,因此应鼓励使用回归模型。