Buchanan Zeruiah V, Hopkins Scarlett E, Boyer Bert B, Fohner Alison E
Department of Epidemiology, University of Washington, Seattle, WA, United States.
Robert Wood Johnson Health Policy Scholars Program, Johns Hopkins University, Baltimore, MD, United States.
Int J Epidemiol. 2025 Feb 16;54(2). doi: 10.1093/ije/dyaf013.
Large biomedical datasets, including electronic health records (EHRs), are a significant source of epidemiologic data. To prepare an EHR for analysis, there are several data-cleaning approaches; here, we focus on data filtering. Common data-filtering methods employ rules that rely on data from socially constructed dominant populations but are inappropriate for marginalized populations, leading to the loss of valuable data and neglect of underrepresented communities. We propose a novel method based on a phenomenological framework that is more equitable and inclusive, leading to culturally responsive research and discoveries.
EHRs from the Yukon-Kuskokwim Health Corporation (YKHC) containing 1 262 035 records from 12 402 unique individuals from 2002 to 2012 were cleaned by using the proposed phenomenological (individual) and common (cohort) data-filtering approach. Within the phenomenological framework, we (i) excluded values that were undeniably biologically impossible for any population, (ii) excludes values that fell outside three standard deviations from the mean value for each individual person, and (iii) used two forms of imputation methods for stable quantitative and qualitative values at the individual level when data were missing.
Compared with common data-filtering practices, the phenomenological approach retained more observations, participants, and a range of outcomes, allowing a truer representation of the priority population. In sensitivity analyses comparing the results of the raw data, the common approach implemented, and the phenomenological approach applied, we found that the phenomenological approach did not compromise the integrity of the results.
The phenomenological approach to filtering big data presents an opportunity to better advocate for marginalized communities even when using large datasets that require automated rules for data filtering. Our method may empower researchers who are partnering with communities to embrace large datasets without compromising their commitment to community benefit and respect.
包括电子健康记录(EHR)在内的大型生物医学数据集是流行病学数据的重要来源。为了准备用于分析的电子健康记录,有几种数据清理方法;在此,我们重点关注数据过滤。常见的数据过滤方法采用依赖于来自社会建构的优势人群数据的规则,但不适用于边缘化人群,导致宝贵数据的丢失以及对代表性不足社区的忽视。我们提出了一种基于现象学框架的新方法,该方法更公平、更具包容性,从而实现具有文化响应性的研究和发现。
使用所提出的现象学(个体)和常见(队列)数据过滤方法,对育空 - 库斯科基姆健康公司(YKHC)2002年至2012年包含12402名独特个体的1262035条记录的电子健康记录进行清理。在现象学框架内,我们(i)排除了对于任何人群在生物学上都不可否认不可能的值,(ii)排除了偏离每个人平均值三个标准差之外的值,并且(iii)当数据缺失时,在个体层面使用两种形式的插补方法来处理稳定的定量和定性值。
与常见的数据过滤做法相比,现象学方法保留了更多的观察结果、参与者和一系列结果,能够更真实地呈现优先人群。在比较原始数据、所实施的常见方法和所应用的现象学方法结果的敏感性分析中,我们发现现象学方法并未损害结果的完整性。
即使在使用需要自动数据过滤规则的大型数据集时,大数据过滤的现象学方法也为更好地倡导边缘化社区提供了机会。我们的方法可能会使与社区合作的研究人员能够接受大型数据集,而不损害他们对社区利益和尊重的承诺。