Cook Lily, Espinoza Juan, Weiskopf Nicole G, Mathews Nisha, Dorr David A, Gonzales Kelly L, Wilcox Adam, Madlock-Brown Charisse
Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR, United States.
Department of Pediatrics, Children's Hospital Los Angeles, Los Angeles, CA, United States.
JMIR Med Inform. 2022 Sep 6;10(9):e39235. doi: 10.2196/39235.
The adverse impact of COVID-19 on marginalized and under-resourced communities of color has highlighted the need for accurate, comprehensive race and ethnicity data. However, a significant technical challenge related to integrating race and ethnicity data in large, consolidated databases is the lack of consistency in how data about race and ethnicity are collected and structured by health care organizations.
This study aims to evaluate and describe variations in how health care systems collect and report information about the race and ethnicity of their patients and to assess how well these data are integrated when aggregated into a large clinical database.
At the time of our analysis, the National COVID Cohort Collaborative (N3C) Data Enclave contained records from 6.5 million patients contributed by 56 health care institutions. We quantified the variability in the harmonized race and ethnicity data in the N3C Data Enclave by analyzing the conformance to health care standards for such data. We conducted a descriptive analysis by comparing the harmonized data available for research purposes in the database to the original source data contributed by health care institutions. To make the comparison, we tabulated the original source codes, enumerating how many patients had been reported with each encoded value and how many distinct ways each category was reported. The nonconforming data were also cross tabulated by 3 factors: patient ethnicity, the number of data partners using each code, and which data models utilized those particular encodings. For the nonconforming data, we used an inductive approach to sort the source encodings into categories. For example, values such as "Declined" were grouped with "Refused," and "Multiple Race" was grouped with "Two or more races" and "Multiracial."
"No matching concept" was the second largest harmonized concept used by the N3C to describe the race of patients in their database. In addition, 20.7% of the race data did not conform to the standard; the largest category was data that were missing. Hispanic or Latino patients were overrepresented in the nonconforming racial data, and data from American Indian or Alaska Native patients were obscured. Although only a small proportion of the source data had not been mapped to the correct concepts (0.6%), Black or African American and Hispanic/Latino patients were overrepresented in this category.
Differences in how race and ethnicity data are conceptualized and encoded by health care institutions can affect the quality of the data in aggregated clinical databases. The impact of data quality issues in the N3C Data Enclave was not equal across all races and ethnicities, which has the potential to introduce bias in analyses and conclusions drawn from these data. Transparency about how data have been transformed can help users make accurate analyses and inferences and eventually better guide clinical care and public policy.
新冠疫情对边缘化及资源匮乏的有色人种社区产生的不利影响凸显了获取准确、全面的种族和族裔数据的必要性。然而,在大型综合数据库中整合种族和族裔数据存在一项重大技术挑战,即医疗保健机构收集和构建种族和族裔数据的方式缺乏一致性。
本研究旨在评估和描述医疗保健系统收集和报告其患者种族和族裔信息的方式差异,并评估将这些数据汇总到大型临床数据库时的整合情况。
在我们进行分析时,国家新冠队列协作组织(N3C)数据中心包含了56家医疗保健机构提供的650万患者记录。我们通过分析此类数据对医疗保健标准的符合程度,对N3C数据中心统一的种族和族裔数据的可变性进行了量化。我们通过将数据库中可用于研究目的的统一数据与医疗保健机构提供的原始源数据进行比较,进行了描述性分析。为了进行比较,我们列出了原始源编码,列举了每个编码值报告的患者数量以及每个类别报告的不同方式。不符合规定的数据还按三个因素进行交叉列表:患者族裔、使用每个编码的数据伙伴数量以及使用这些特定编码的哪些数据模型。对于不符合规定的数据,我们采用归纳法将源编码分类。例如,“拒绝回答”等值与“拒绝提供”归为一组,“多种族”与“两个或更多种族”以及“混血”归为一组。
“无匹配概念”是N3C用于描述其数据库中患者种族的第二大统一概念。此外,20.7%的种族数据不符合标准;最大的类别是缺失的数据。西班牙裔或拉丁裔患者在不符合规定的种族数据中占比过高,而美国印第安人或阿拉斯加原住民患者的数据被掩盖。虽然只有一小部分源数据未映射到正确的概念(0.6%),但在这一类别中,黑人或非裔美国人和西班牙裔/拉丁裔患者占比过高。
医疗保健机构对种族和族裔数据的概念化和编码方式的差异会影响汇总临床数据库中数据的质量。N3C数据中心数据质量问题的影响在所有种族和族裔中并不相同,这有可能在基于这些数据得出的分析和结论中引入偏差。数据转换方式的透明度有助于用户进行准确的分析和推断,并最终更好地指导临床护理和公共政策。