从电子病历中获取妊娠和孕产信息的有效隐私保护策略：中国国家医疗保健数据网络的回顾性研究。

Effective Privacy Protection Strategies for Pregnancy and Gestation Information From Electronic Medical Records: Retrospective Study in a National Health Care Data Network in China.

机构信息

Digital Health China Technologies Co, Ltd, Beijing, China.

Department of Nephrology, Nanfang Hospital, Southern Medical University, Guangzhou, China.

出版信息

J Med Internet Res. 2024 Aug 20;26:e46455. doi: 10.2196/46455.

DOI:10.2196/46455

PMID:39163593

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11372317/

Abstract

BACKGROUND

Pregnancy and gestation information is routinely recorded in electronic medical record (EMR) systems across China in various data sets. The combination of data on the number of pregnancies and gestations can imply occurrences of abortions and other pregnancy-related issues, which is important for clinical decision-making and personal privacy protection. However, the distribution of this information inside EMR is variable due to inconsistent IT structures across different EMR systems. A large-scale quantitative evaluation of the potential exposure of this sensitive information has not been previously performed, ensuring the protection of personal information is a priority, as emphasized in Chinese laws and regulations.

OBJECTIVE

This study aims to perform the first nationwide quantitative analysis of the identification sites and exposure frequency of sensitive pregnancy and gestation information. The goal is to propose strategies for effective information extraction and privacy protection related to women's health.

METHODS

This study was conducted in a national health care data network. Rule-based protocols for extracting pregnancy and gestation information were developed by a committee of experts. A total of 6 different sub-data sets of EMRs were used as schemas for data analysis and strategy proposal. The identification sites and frequencies of identification in different sub-data sets were calculated. Manual quality inspections of the extraction process were performed by 2 independent groups of reviewers on 1000 randomly selected records. Based on these statistics, strategies for effective information extraction and privacy protection were proposed.

RESULTS

The data network covered hospitalized patients from 19 hospitals in 10 provinces of China, encompassing 15,245,055 patients over an 11-year period (January 1, 2010-December 12, 2020). Among women aged 14-50 years, 70% were randomly selected from each hospital, resulting in a total of 1,110,053 patients. Of these, 688,268 female patients with sensitive reproductive information were identified. The frequencies of identification were variable, with the marriage history in admission medical records being the most frequent at 63.24%. Notably, more than 50% of female patients were identified with pregnancy and gestation history in nursing records, which is not generally considered a sub-data set rich in reproductive information. During the manual curation and review process, 1000 cases were randomly selected, and the precision and recall rates of the information extraction method both exceeded 99.5%. The privacy-protection strategies were designed with clear technical directions.

CONCLUSIONS

Significant amounts of critical information related to women's health are recorded in Chinese routine EMR systems and are distributed in various parts of the records with different frequencies. This requires a comprehensive protocol for extracting and protecting the information, which has been demonstrated to be technically feasible. Implementing a data-based strategy will enhance the protection of women's privacy and improve the accessibility of health care services.

摘要

背景

中国的电子病历（EMR）系统中通常会记录妊娠和分娩信息，这些信息分布在不同的数据集中。对妊娠和分娩次数进行组合可以推断出流产和其他妊娠相关问题的发生情况，这对于临床决策和个人隐私保护非常重要。然而，由于不同 EMR 系统的 IT 结构不一致，这些信息在 EMR 中的分布方式也各不相同。以前从未对这种敏感信息的潜在暴露情况进行过大规模的定量评估，确保个人信息的保护是中国法律法规所强调的优先事项。

目的

本研究旨在对敏感妊娠和分娩信息的识别地点和暴露频率进行首次全国范围的定量分析，提出与妇女健康相关的有效信息提取和隐私保护策略。

方法

本研究在国家卫生保健数据网络中进行。由专家委员会制定了用于提取妊娠和分娩信息的基于规则的协议。使用 6 个不同的 EMR 子数据集作为数据分析和策略建议的方案。计算了不同子数据集中的识别地点和识别频率。由 2 组独立的审核员对 1000 份随机选择的记录进行了提取过程的手动质量检查。基于这些统计数据，提出了有效的信息提取和隐私保护策略。

结果

该数据网络涵盖了来自中国 10 个省的 19 家医院的住院患者，在 11 年期间（2010 年 1 月 1 日至 2020 年 12 月 12 日）共涵盖了 1524.5055 万名患者。在 14-50 岁的女性中，每个医院随机抽取 70%，共抽取了 111.053 名女性患者。其中，有 688268 名女性患者的敏感生殖信息被识别。识别频率各不相同，入院病历中的婚姻史最为常见，为 63.24%。值得注意的是，超过 50%的女性患者在护理记录中被识别出有妊娠和分娩史，这通常不被认为是生殖信息丰富的子数据集。在手动审核和审查过程中，随机抽取了 1000 例，信息提取方法的准确率和召回率均超过 99.5%。隐私保护策略的设计具有明确的技术方向。