Department of Biomedical Informatics, Columbia University, New York, NY.
Division of Infectious Diseases, Department of Medicine, Columbia University, New York, NY.
J Acquir Immune Defic Syndr. 2018 Feb 1;77(2):160-166. doi: 10.1097/QAI.0000000000001580.
Universal HIV screening programs are costly, labor intensive, and often fail to identify high-risk individuals. Automated risk assessment methods that leverage longitudinal electronic health records (EHRs) could catalyze targeted screening programs. Although social and behavioral determinants of health are typically captured in narrative documentation, previous analyses have considered only structured EHR fields. We examined whether natural language processing (NLP) would improve predictive models of HIV diagnosis.
One hundred eighty-one HIV+ individuals received care at New York Presbyterian Hospital before a confirmatory HIV diagnosis and 543 HIV negative controls were selected using propensity score matching and included in the study cohort. EHR data including demographics, laboratory tests, diagnosis codes, and unstructured notes before HIV diagnosis were extracted for modeling. Three predictive algorithms were developed using machine-learning algorithms: (1) a baseline model with only structured EHR data, (2) baseline plus NLP topics, and (3) baseline plus NLP clinical keywords.
Predictive models demonstrated a range of performance with F measures of 0.59 for the baseline model, 0.63 for the baseline + NLP topic model, and 0.74 for the baseline + NLP keyword model. The baseline + NLP keyword model yielded the highest precision by including keywords including "msm," "unprotected," "hiv," and "methamphetamine," and structured EHR data indicative of additional HIV risk factors.
NLP improved the predictive performance of automated HIV risk assessment by extracting terms in clinical text indicative of high-risk behavior. Future studies should explore more advanced techniques for extracting social and behavioral determinants from clinical text.
普及艾滋病毒筛查计划成本高昂,劳动强度大,且往往无法识别高危人群。利用纵向电子健康记录(EHR)的自动化风险评估方法可以促进有针对性的筛查计划。尽管健康的社会和行为决定因素通常在叙述性文件中记录,但之前的分析仅考虑了结构化 EHR 字段。我们研究了自然语言处理(NLP)是否会提高艾滋病毒诊断预测模型的性能。
181 名艾滋病毒阳性个体在纽约长老会医院接受治疗,然后在确诊艾滋病毒之前接受了检查,并且使用倾向评分匹配选择了 543 名艾滋病毒阴性对照者,并将其纳入研究队列。提取 EHR 数据,包括诊断前的人口统计学数据、实验室检查、诊断代码和非结构化记录,用于建模。使用机器学习算法开发了三种预测算法:(1)仅使用结构化 EHR 数据的基线模型;(2)基线+NLP 主题模型;(3)基线+NLP 临床关键词模型。
预测模型的性能范围广泛,基线模型的 F 度量为 0.59,基线+NLP 主题模型为 0.63,基线+NLP 关键词模型为 0.74。基线+NLP 关键词模型通过包含“男男性接触者”、“无保护措施”、“艾滋病毒”和“甲基苯丙胺”等关键词以及结构化 EHR 数据,提示了其他艾滋病毒风险因素,从而实现了最高的精度。
NLP 通过提取临床文本中表示高危行为的术语,提高了自动化艾滋病毒风险评估的预测性能。未来的研究应该探索从临床文本中提取社会和行为决定因素的更先进技术。