EPIUnit - Instituto de Saúde Pública, Universidade do Porto, Rua das Taipas, nº 135, 4050-600, Porto, Portugal.
Laboratório para a Investigação Integrativa e Translacional em Saúde Populacional (ITR), Porto, Portugal.
Sci Rep. 2022 Jun 22;12(1):10587. doi: 10.1038/s41598-022-13946-z.
The timely identification of cohort participants at higher risk for attrition is important to earlier interventions and efficient use of research resources. Machine learning may have advantages over the conventional approaches to improve discrimination by analysing complex interactions among predictors. We developed predictive models of attrition applying a conventional regression model and different machine learning methods. A total of 542 very preterm (< 32 gestational weeks) infants born in Portugal as part of the European Effective Perinatal Intensive Care in Europe (EPICE) cohort were included. We tested a model with a fixed number of predictors (Baseline) and a second with a dynamic number of variables added from each follow-up (Incremental). Eight classification methods were applied: AdaBoost, Artificial Neural Networks, Functional Trees, J48, J48Consolidated, K-Nearest Neighbours, Random Forest and Logistic Regression. Performance was compared using AUC- PR (Area Under the Curve-Precision Recall), Accuracy, Sensitivity and F-measure. Attrition at the four follow-ups were, respectively: 16%, 25%, 13% and 17%. Both models demonstrated good predictive performance, AUC-PR ranging between 69 and 94.1 in Baseline and from 72.5 to 97.1 in Incremental model. Of the whole set of methods, Random Forest presented the best performance at all follow-ups [AUC-PR: 94.1 (2.0); AUC-PR: 91.2 (1.2); AUC-PR: 97.1 (1.0); AUC-PR: 96.5 (1.7)]. Logistic Regression performed well below Random Forest. The top-ranked predictors were common for both models in all follow-ups: birthweight, gestational age, maternal age, and length of hospital stay. Random Forest presented the highest capacity for prediction and provided interpretable predictors. Researchers involved in cohorts can benefit from our robust models to prepare for and prevent loss to follow-up by directing efforts toward individuals at higher risk.
及时识别有较高脱落风险的队列参与者对于早期干预和有效利用研究资源非常重要。机器学习在分析预测因子之间的复杂交互作用以提高区分能力方面可能具有优势。我们应用传统回归模型和不同的机器学习方法开发了脱落预测模型。共纳入了 542 名出生于葡萄牙的非常早产儿(<32 孕周),他们是欧洲有效围产期密集护理(EPICE)队列的一部分。我们测试了一个具有固定数量预测因子的模型(基线)和第二个具有从每次随访中添加的动态数量变量的模型(增量)。应用了八种分类方法:AdaBoost、人工神经网络、功能树、J48、J48Consolidated、K-近邻、随机森林和逻辑回归。使用 AUC-PR(曲线下面积-精度召回率)、准确性、敏感性和 F 度量来比较性能。四次随访的脱落率分别为:16%、25%、13%和 17%。两种模型均表现出良好的预测性能,基线模型的 AUC-PR 范围为 69 至 94.1,增量模型的 AUC-PR 范围为 72.5 至 97.1。在整个方法中,随机森林在所有随访中表现出最佳性能[AUC-PR:94.1(2.0);AUC-PR:91.2(1.2);AUC-PR:97.1(1.0);AUC-PR:96.5(1.7)]。逻辑回归的表现明显低于随机森林。前几个排名最高的预测因子在所有随访中均为两种模型的共同预测因子:出生体重、胎龄、母亲年龄和住院时间。随机森林具有最高的预测能力,并提供可解释的预测因子。参与队列研究的研究人员可以从我们稳健的模型中受益,通过针对高风险个体进行努力,为随访流失做好准备并加以预防。