Bear Don't Walk Oliver J, Pichon Adrienne, Reyes Nieva Harry, Sun Tony, Li Jaan, Joseph Josh, Kinberg Sivan, Richter Lauren R, Crusco Salvatore, Kulas Kyle, Ahmed Shaan A, Snyder Daniel, Rahbari Ashkon, Ranard Benjamin L, Juneja Pallavi, Demner-Fushman Dina, Elhadad Noémie
University of Washington, Seattle, Washington, USA.
Columbia University Irving Medical Center, New York, New York, USA.
Sci Data. 2024 Dec 5;11(1):1332. doi: 10.1038/s41597-024-04183-2.
Observational health research often relies on accurate and complete race and ethnicity (RE) patient information, such as characterizing cohorts, assessing quality/performance metrics of hospitals and health systems, and identifying health disparities. While the electronic health record contains structured data such as accessible patient-level RE data, it is often missing, inaccurate, or lacking granular details. Natural language processing models can be trained to identify RE in clinical text which can supplement missing RE data in clinical data repositories. Here we describe the Contextualized Race and Ethnicity Annotations for Clinical Text (C-REACT) Dataset, which comprises 12,000 patients and 17,281 sentences from their clinical notes in the MIMIC-III dataset. Using these sentences, two sets of reference standard annotations for RE data are made available with annotation guidelines. The first set of annotations comprise highly granular information related to RE, such as preferred language and country of origin, while the second set contains RE labels annotated by physicians. This dataset can support health systems' ability to use RE data to serve health equity goals.
观察性健康研究通常依赖准确和完整的种族和族裔(RE)患者信息,例如对队列进行特征描述、评估医院和卫生系统的质量/绩效指标,以及识别健康差异。虽然电子健康记录包含结构化数据,如可获取的患者层面的RE数据,但这些数据往往缺失、不准确或缺乏详细信息。可以训练自然语言处理模型来识别临床文本中的RE,这可以补充临床数据存储库中缺失的RE数据。在此,我们描述了临床文本的情境化种族和族裔注释(C-REACT)数据集,该数据集包含来自MIMIC-III数据集中12000名患者及其临床记录中的17281个句子。利用这些句子,提供了两组带有注释指南的RE数据参考标准注释。第一组注释包括与RE相关的高度详细信息,如首选语言和原籍国,而第二组包含医生注释的RE标签。该数据集可以支持卫生系统利用RE数据实现健康公平目标的能力。