Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Netherlands Institute for Health Services Research (Nivel), Utrecht, The Netherlands.
BMJ Open. 2022 Aug 30;12(8):e060458. doi: 10.1136/bmjopen-2021-060458.
Heart failure (HF) is a commonly occurring health problem with high mortality and morbidity. If potential cases could be detected earlier, it may be possible to intervene earlier, which may slow progression in some patients. Preferably, it is desired to reuse already measured data for screening of all persons in an age group, such as general practitioner (GP) data. Furthermore, it is essential to evaluate the number of people needed to screen to find one patient using true incidence rates, as this indicates the generalisability in the true population. Therefore, we aim to create a machine learning model for the prediction of HF using GP data and evaluate the number needed to screen with true incidence rates.
DESIGN, SETTINGS AND PARTICIPANTS: GP data from 8543 patients (-2 to -1 year before diagnosis) and controls aged 70+ years were obtained retrospectively from 01 January 2012 to 31 December 2019 from the Nivel Primary Care Database. Codes about chronic illness, complaints, diagnostics and medication were obtained. Data were split in a train/test set. Datasets describing demographics, the presence of codes (non-sequential) and upon each other following codes (sequential) were created. Logistic regression, random forest and XGBoost models were trained. Predicted outcome was the presence of HF after 1 year. The ratio case:control in the test set matched true incidence rates (1:45).
Sole demographics performed average (area under the curve (AUC) 0.692, CI 0.677 to 0.706). Adding non-sequential information combined with a logistic regression model performed best and significantly improved performance (AUC 0.772, CI 0.759 to 0.785, p<0.001). Further adding sequential information did not alter performance significantly (AUC 0.767, CI 0.754 to 0.780, p=0.07). The number needed to screen dropped from 14.11 to 5.99 false positives per true positive.
This study created a model able to identify patients with pending HF a year before diagnosis.
心力衰竭(HF)是一种常见的健康问题,具有较高的死亡率和发病率。如果能够更早地发现潜在病例,就有可能更早地进行干预,从而减缓某些患者的病情进展。理想情况下,希望能够重用已经测量的数据来筛查某个年龄段的所有人,例如全科医生(GP)的数据。此外,评估使用真实发病率筛查一个患者所需的人数至关重要,因为这表明了真实人群的普遍性。因此,我们旨在使用 GP 数据创建一种用于预测 HF 的机器学习模型,并使用真实发病率评估所需的筛查人数。
设计、设置和参与者:从 2012 年 1 月 1 日至 2019 年 12 月 31 日,从 Nivel 初级保健数据库中回顾性地获取了 8543 名年龄在 70 岁及以上的患者(诊断前-2 至-1 年)和对照者的 GP 数据。获取了有关慢性病、投诉、诊断和药物治疗的代码。将数据分为训练/测试集。创建了描述人口统计学、存在代码(非连续)和彼此跟随的代码(连续)的数据集。训练了逻辑回归、随机森林和 XGBoost 模型。预测结果为 1 年后是否存在 HF。测试集中的病例:对照比例与真实发病率相匹配(1:45)。
仅人口统计学数据的表现平均(曲线下面积(AUC)0.692,CI 0.677 至 0.706)。添加非连续信息并结合逻辑回归模型可显著提高性能(AUC 0.772,CI 0.759 至 0.785,p<0.001)。进一步添加连续信息不会显著改变性能(AUC 0.767,CI 0.754 至 0.780,p=0.07)。真阳性每假阳性的筛查人数从 14.11 下降至 5.99。
本研究创建了一种能够在诊断前一年识别出有潜在 HF 患者的模型。