Department of Radiology, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi, Taiwan.
Department of Neurology, Tainan Sin Lau Hospital, Tainan, Taiwan.
Front Public Health. 2022 Sep 29;10:1009164. doi: 10.3389/fpubh.2022.1009164. eCollection 2022.
Identifying patients at high risk of stroke-associated pneumonia (SAP) may permit targeting potential interventions to reduce its incidence. We aimed to explore the functionality of machine learning (ML) and natural language processing techniques on structured data and unstructured clinical text to predict SAP by comparing it to conventional risk scores.
Linked data between a hospital stroke registry and a deidentified research-based database including electronic health records and administrative claims data was used. Natural language processing was applied to extract textual features from clinical notes. The random forest algorithm was used to build ML models. The predictive performance of ML models was compared with the ADS, ISAN, PNA, and ACDD scores using the area under the receiver operating characteristic curve (AUC).
Among 5,913 acute stroke patients hospitalized between Oct 2010 and Sep 2021, 450 (7.6%) developed SAP within the first 7 days after stroke onset. The ML model based on both textual features and structured variables had the highest AUC [0.840, 95% confidence interval (CI) 0.806-0.875], significantly higher than those of the ML model based on structured variables alone (0.828, 95% CI 0.793-0.863, = 0.040), ACDD (0.807, 95% CI 0.766-0.849, = 0.041), ADS (0.803, 95% CI 0.762-0.845, = 0.013), ISAN (0.795, 95% CI 0.752-0.837, = 0.009), and PNA (0.778, 95% CI 0.735-0.822, < 0.001). All models demonstrated adequate calibration except for the ADS score.
The ML model based on both textural features and structured variables performed better than conventional risk scores in predicting SAP. The workflow used to generate ML prediction models can be disseminated for local adaptation by individual healthcare organizations.
识别发生卒中相关性肺炎(SAP)风险较高的患者,可能有助于针对该疾病采取潜在干预措施,以降低其发病率。我们旨在通过与传统风险评分比较,探索机器学习(ML)和自然语言处理技术在结构化数据和非结构化临床文本上的功能,以预测 SAP。
利用医院卒中登记处与一个去识别的以研究为基础的数据库之间的关联数据,该数据库包括电子健康记录和管理索赔数据。自然语言处理被用于从临床记录中提取文本特征。随机森林算法被用于构建 ML 模型。使用接受者操作特征曲线下面积(AUC)比较 ML 模型与 ADS、ISAN、PNA 和 ACDD 评分的预测性能。
在 2010 年 10 月至 2021 年 9 月间住院的 5913 例急性卒中患者中,450 例(7.6%)在卒中发病后 7 天内发生 SAP。基于文本特征和结构化变量的 ML 模型具有最高的 AUC[0.840,95%置信区间(CI)0.806-0.875],显著高于仅基于结构化变量的 ML 模型(0.828,95%CI 0.793-0.863,=0.040)、ACDD(0.807,95%CI 0.766-0.849,=0.041)、ADS(0.803,95%CI 0.762-0.845,=0.013)、ISAN(0.795,95%CI 0.752-0.837,=0.009)和 PNA(0.778,95%CI 0.735-0.822,<0.001)。所有模型的校准情况均良好,除了 ADS 评分。
基于文本特征和结构化变量的 ML 模型在预测 SAP 方面优于传统风险评分。用于生成 ML 预测模型的工作流程可由单个医疗机构传播并进行本地化调整。