Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA.
Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, Ohio, USA.
J Am Med Inform Assoc. 2021 Sep 18;28(10):2116-2127. doi: 10.1093/jamia/ocab116.
Substance use screening in adolescence is unstandardized and often documented in clinical notes, rather than in structured electronic health records (EHRs). The objective of this study was to integrate logic rules with state-of-the-art natural language processing (NLP) and machine learning technologies to detect substance use information from both structured and unstructured EHR data.
Pediatric patients (10-20 years of age) with any encounter between July 1, 2012, and October 31, 2017, were included (n = 3890 patients; 19 478 encounters). EHR data were extracted at each encounter, manually reviewed for substance use (alcohol, tobacco, marijuana, opiate, any use), and coded as lifetime use, current use, or family use. Logic rules mapped structured EHR indicators to screening results. A knowledge-based NLP system and a deep learning model detected substance use information from unstructured clinical narratives. System performance was evaluated using positive predictive value, sensitivity, negative predictive value, specificity, and area under the receiver-operating characteristic curve (AUC).
The dataset included 17 235 structured indicators and 27 141 clinical narratives. Manual review of clinical narratives captured 94.0% of positive screening results, while structured EHR data captured 22.0%. Logic rules detected screening results from structured data with 1.0 and 0.99 for sensitivity and specificity, respectively. The knowledge-based system detected substance use information from clinical narratives with 0.86, 0.79, and 0.88 for AUC, sensitivity, and specificity, respectively. The deep learning model further improved detection capacity, achieving 0.88, 0.81, and 0.85 for AUC, sensitivity, and specificity, respectively. Finally, integrating predictions from structured and unstructured data achieved high detection capacity across all cases (0.96, 0.85, and 0.87 for AUC, sensitivity, and specificity, respectively).
It is feasible to detect substance use screening and results among pediatric patients using logic rules, NLP, and machine learning technologies.
青少年物质使用筛查尚未标准化,且通常记录在临床笔记中,而不是在结构化电子健康记录(EHR)中。本研究的目的是整合逻辑规则与最先进的自然语言处理(NLP)和机器学习技术,以从结构化和非结构化 EHR 数据中检测物质使用信息。
纳入 2012 年 7 月 1 日至 2017 年 10 月 31 日期间的任何就诊的儿科患者(10-20 岁,n=3890 例患者;19478 次就诊)。在每次就诊时提取 EHR 数据,人工审查物质使用(酒精、烟草、大麻、阿片类药物、任何使用)情况,并编码为终生使用、当前使用或家庭使用。逻辑规则将结构化 EHR 指标映射到筛查结果。基于知识的 NLP 系统和深度学习模型从非结构化临床叙述中检测物质使用信息。使用阳性预测值、灵敏度、阴性预测值、特异性和接收器工作特征曲线下的面积(AUC)评估系统性能。
数据集包括 17235 个结构化指标和 27141 条临床叙述。对临床叙述的人工审查捕获了 94.0%的阳性筛查结果,而结构化 EHR 数据仅捕获了 22.0%。逻辑规则分别以 1.0 和 0.99 的灵敏度和特异性检测到来自结构化数据的筛查结果。基于知识的系统从临床叙述中检测物质使用信息,AUC、灵敏度和特异性分别为 0.86、0.79 和 0.88。深度学习模型进一步提高了检测能力,AUC、灵敏度和特异性分别为 0.88、0.81 和 0.85。最后,整合结构化和非结构化数据的预测在所有情况下均具有较高的检测能力(AUC、灵敏度和特异性分别为 0.96、0.85 和 0.87)。
使用逻辑规则、NLP 和机器学习技术检测儿科患者的物质使用筛查和结果是可行的。