Information Technologies & Services Department, Weill Cornell Medicine, New York, New York, USA.
Department of Medicine, Weill Cornell Medicine, New York, New York, USA.
J Am Med Inform Assoc. 2019 Aug 1;26(8-9):722-729. doi: 10.1093/jamia/ocz040.
We aimed to address deficiencies in structured electronic health record (EHR) data for race and ethnicity by identifying black and Hispanic patients from unstructured clinical notes and assessing differences between patients with or without structured race/ethnicity data.
Using EHR notes for 16 665 patients with encounters at a primary care practice, we developed rule-based natural language processing (NLP) algorithms to classify patients as black/Hispanic. We evaluated performance of the method against an annotated gold standard, compared race and ethnicity between NLP-derived and structured EHR data, and compared characteristics of patients identified as black or Hispanic using only NLP vs patients identified as such only in structured EHR data.
For the sample of 16 665 patients, NLP identified 948 additional patients as black, a 26%increase, and 665 additional patients as Hispanic, a 20% increase. Compared with the patients identified as black or Hispanic in structured EHR data, patients identified as black or Hispanic via NLP only were older, more likely to be male, less likely to have commercial insurance, and more likely to have higher comorbidity.
Structured EHR data for race and ethnicity are subject to data quality issues. Supplementing structured EHR race data with NLP-derived race and ethnicity may allow researchers to better assess the demographic makeup of populations and draw more accurate conclusions about intergroup differences in health outcomes.
Black or Hispanic patients who are not documented as such in structured EHR race/ethnicity fields differ significantly from those who are. Relatively simple NLP can help address this limitation.
通过从非结构化临床记录中识别黑人和西班牙裔患者,并评估有或没有结构化种族/民族数据患者之间的差异,来解决种族和民族的结构化电子健康记录(EHR)数据中的缺陷。
我们使用了一家基层医疗机构的 16665 名患者就诊时的 EHR 记录,开发了基于规则的自然语言处理(NLP)算法,将患者分类为黑人/西班牙裔。我们评估了该方法的性能与注释黄金标准的一致性,比较了 NLP 衍生数据和结构化 EHR 数据中的种族和民族,比较了仅使用 NLP 识别为黑人或西班牙裔的患者与仅在结构化 EHR 数据中识别为黑人或西班牙裔的患者的特征。
在 16665 名患者的样本中,NLP 额外识别出 948 名黑人患者,增加了 26%,额外识别出 665 名西班牙裔患者,增加了 20%。与在结构化 EHR 数据中识别为黑人或西班牙裔的患者相比,仅通过 NLP 识别为黑人或西班牙裔的患者年龄更大,更可能是男性,更可能没有商业保险,且更可能患有更高的合并症。
种族和民族的结构化 EHR 数据存在数据质量问题。使用 NLP 衍生的种族和民族数据补充结构化 EHR 种族数据,可以使研究人员更好地评估人群的人口构成,并更准确地得出关于健康结果的群体间差异的结论。
未在结构化 EHR 种族/民族字段中记录为黑人或西班牙裔的患者与记录为黑人或西班牙裔的患者有显著差异。相对简单的 NLP 可以帮助解决这一限制。