改善定量大数据清理公平性的方案：来自对代表性不足和边缘化社区电子健康记录纵向分析的经验教训。

Protocol for improving equity in quantitative big data cleaning: lessons from longitudinal analysis of electronic health records from underrepresented and marginalized communities.

作者信息

Buchanan Zeruiah V, Hopkins Scarlett E, Boyer Bert B, Fohner Alison E

机构信息

Department of Epidemiology, University of Washington, Seattle, WA, United States.

Robert Wood Johnson Health Policy Scholars Program, Johns Hopkins University, Baltimore, MD, United States.

出版信息

Int J Epidemiol. 2025 Feb 16;54(2). doi: 10.1093/ije/dyaf013.

DOI:10.1093/ije/dyaf013

PMID:40037558

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12225664/

Abstract

BACKGROUND

Large biomedical datasets, including electronic health records (EHRs), are a significant source of epidemiologic data. To prepare an EHR for analysis, there are several data-cleaning approaches; here, we focus on data filtering. Common data-filtering methods employ rules that rely on data from socially constructed dominant populations but are inappropriate for marginalized populations, leading to the loss of valuable data and neglect of underrepresented communities. We propose a novel method based on a phenomenological framework that is more equitable and inclusive, leading to culturally responsive research and discoveries.

METHODS

EHRs from the Yukon-Kuskokwim Health Corporation (YKHC) containing 1 262 035 records from 12 402 unique individuals from 2002 to 2012 were cleaned by using the proposed phenomenological (individual) and common (cohort) data-filtering approach. Within the phenomenological framework, we (i) excluded values that were undeniably biologically impossible for any population, (ii) excludes values that fell outside three standard deviations from the mean value for each individual person, and (iii) used two forms of imputation methods for stable quantitative and qualitative values at the individual level when data were missing.

RESULTS

Compared with common data-filtering practices, the phenomenological approach retained more observations, participants, and a range of outcomes, allowing a truer representation of the priority population. In sensitivity analyses comparing the results of the raw data, the common approach implemented, and the phenomenological approach applied, we found that the phenomenological approach did not compromise the integrity of the results.

CONCLUSION

The phenomenological approach to filtering big data presents an opportunity to better advocate for marginalized communities even when using large datasets that require automated rules for data filtering. Our method may empower researchers who are partnering with communities to embrace large datasets without compromising their commitment to community benefit and respect.

摘要

背景

包括电子健康记录（EHR）在内的大型生物医学数据集是流行病学数据的重要来源。为了准备用于分析的电子健康记录，有几种数据清理方法；在此，我们重点关注数据过滤。常见的数据过滤方法采用依赖于来自社会建构的优势人群数据的规则，但不适用于边缘化人群，导致宝贵数据的丢失以及对代表性不足社区的忽视。我们提出了一种基于现象学框架的新方法，该方法更公平、更具包容性，从而实现具有文化响应性的研究和发现。

方法

使用所提出的现象学（个体）和常见（队列）数据过滤方法，对育空 - 库斯科基姆健康公司（YKHC）2002年至2012年包含12402名独特个体的1262035条记录的电子健康记录进行清理。在现象学框架内，我们（i）排除了对于任何人群在生物学上都不可否认不可能的值，（ii）排除了偏离每个人平均值三个标准差之外的值，并且（iii）当数据缺失时，在个体层面使用两种形式的插补方法来处理稳定的定量和定性值。

结果

与常见的数据过滤做法相比，现象学方法保留了更多的观察结果、参与者和一系列结果，能够更真实地呈现优先人群。在比较原始数据、所实施的常见方法和所应用的现象学方法结果的敏感性分析中，我们发现现象学方法并未损害结果的完整性。

结论

即使在使用需要自动数据过滤规则的大型数据集时，大数据过滤的现象学方法也为更好地倡导边缘化社区提供了机会。我们的方法可能会使与社区合作的研究人员能够接受大型数据集，而不损害他们对社区利益和尊重的承诺。

相似文献

Protocol for improving equity in quantitative big data cleaning: lessons from longitudinal analysis of electronic health records from underrepresented and marginalized communities.改善定量大数据清理公平性的方案：来自对代表性不足和边缘化社区电子健康记录纵向分析的经验教训。

Int J Epidemiol. 2025 Feb 16;54(2). doi: 10.1093/ije/dyaf013.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Eliciting adverse effects data from participants in clinical trials.从临床试验参与者中获取不良反应数据。

Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

Factors that influence parents' and informal caregivers' views and practices regarding routine childhood vaccination: a qualitative evidence synthesis.影响父母和非正式照顾者对常规儿童疫苗接种看法和做法的因素：定性证据综合分析。

Cochrane Database Syst Rev. 2021 Oct 27;10(10):CD013265. doi: 10.1002/14651858.CD013265.pub2.

Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物：网状Meta分析

Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.

Unconditional cash transfers for reducing poverty and vulnerabilities: effect on use of health services and health outcomes in low- and middle-income countries.无条件现金转移以减少贫困和脆弱性：对中低收入国家卫生服务利用和健康结果的影响。

Cochrane Database Syst Rev. 2022 Mar 29;3(3):CD011135. doi: 10.1002/14651858.CD011135.pub3.

Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.首次就诊时磁共振灌注成像用于鉴别低级别与高级别胶质瘤

Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Does Augmenting Irradiated Autografts With Free Vascularized Fibula Graft in Patients With Bone Loss From a Malignant Tumor Achieve Union, Function, and Complication Rate Comparably to Patients Without Bone Loss and Augmentation When Reconstructing Intercalary Resections in the Lower Extremity?对于因恶性肿瘤导致骨缺损的患者，在重建下肢节段性切除时，采用带血管游离腓骨移植来增强照射后的自体骨移植，其骨愈合、功能及并发症发生率与无骨缺损且未进行增强的患者相比是否相当？

Clin Orthop Relat Res. 2025 Jun 26. doi: 10.1097/CORR.0000000000003599.

本文引用的文献

Electronic health record reveals community-level cardiometabolic health benefits associated with 10 years of community-based participatory research.电子健康记录揭示了与 10 年社区为基础的参与式研究相关的社区层面的心血代谢健康益处。

Public Health. 2024 Jul;232:38-44. doi: 10.1016/j.puhe.2024.04.010. Epub 2024 May 10.

Obesity-Associated Dyslipidemia Is Moderated by Habitual Intake of Marine-Derived n-3 Polyunsaturated Fatty Acids in Yup'ik Alaska Native People: A Cross-Sectional Mediation-Moderation Analysis.肥胖相关血脂异常受Yup'ik 阿拉斯加原住民习惯性摄入海洋衍生 n-3 多不饱和脂肪酸的调节：一项横断面中介-调节分析。

J Nutr. 2023 Jan;153(1):279-292. doi: 10.1016/j.tjnut.2022.10.009. Epub 2022 Dec 21.

An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge.结合临床知识的电子健康记录自动化数据清洗方法。

BMC Med Inform Decis Mak. 2021 Sep 17;21(1):267. doi: 10.1186/s12911-021-01630-7.

Challenging racism in the use of health data.挑战健康数据使用中的种族主义问题。

Lancet Digit Health. 2021 Mar;3(3):e144-e146. doi: 10.1016/S2589-7500(21)00019-4. Epub 2021 Feb 3.

Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort.从纵向电子健康记录中自动清理儿科人体测量数据：方案和在大型患者队列中的应用。

Sci Rep. 2020 Jun 23;10(1):10164. doi: 10.1038/s41598-020-66925-7.

Missing data in longitudinal studies: Comparison of multiple imputation methods in a real clinical setting.纵向研究中的缺失数据：真实临床环境下多种插补方法的比较。

J Eval Clin Pract. 2021 Feb;27(1):34-41. doi: 10.1111/jep.13376. Epub 2020 Feb 26.

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.是否是时候停止将数据清理问题掩盖起来了？一种新的生长数据异常值管理算法。

PLoS One. 2020 Jan 24;15(1):e0228154. doi: 10.1371/journal.pone.0228154. eCollection 2020.

A basic model for assessing primary health care electronic medical record data quality.评估初级卫生保健电子病历数据质量的基本模型。

BMC Med Inform Decis Mak. 2019 Feb 12;19(1):30. doi: 10.1186/s12911-019-0740-0.

The impact of electronic health records on diagnosis.电子健康记录对诊断的影响。

Diagnosis (Berl). 2017 Nov 27;4(4):211-223. doi: 10.1515/dx-2017-0012.

Bi-cultural dynamics for risk and protective factors for cardiometabolic health in an Alaska Native (Yup'ik) population.阿拉斯加原住民（尤皮克族）人群中心血管代谢健康风险与保护因素的双文化动态

PLoS One. 2017 Nov 1;12(11):e0183451. doi: 10.1371/journal.pone.0183451. eCollection 2017.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验