Gong Mengchun, Wang Shuang, Wang Lezi, Liu Chao, Wang Jianyang, Guo Qiang, Zheng Hao, Xie Kang, Wang Chenghong, Hui Zhouguang
Digital China Health Technologies Corporation Limited, Beijing, China.
Shanghai Putuo People's Hospital, Tongji University, Shanghai, China.
JMIR Med Inform. 2020 Feb 5;8(2):e13046. doi: 10.2196/13046.
Patient privacy is a ubiquitous problem around the world. Many existing studies have demonstrated the potential privacy risks associated with sharing of biomedical data. Owing to the increasing need for data sharing and analysis, health care data privacy is drawing more attention. However, to better protect biomedical data privacy, it is essential to assess the privacy risk in the first place.
In China, there is no clear regulation for health systems to deidentify data. It is also not known whether a mechanism such as the Health Insurance Portability and Accountability Act (HIPAA) safe harbor policy will achieve sufficient protection. This study aimed to conduct a pilot study using patient data from Chinese hospitals to understand and quantify the privacy risks of Chinese patients.
We used g-distinct analysis to evaluate the reidentification risks with regard to the HIPAA safe harbor approach when applied to Chinese patients' data. More specifically, we estimated the risks based on the HIPAA safe harbor and limited dataset policies by assuming an attacker has background knowledge of the patient from the public domain.
The experiments were conducted on 0.83 million patients (with data field of date of birth, gender, and surrogate ZIP codes generated based on home address) across 33 provincial-level administrative divisions in China. Under the Limited Dataset policy, 19.58% (163,262/833,235) of the population could be uniquely identifiable under the g-distinct metric (ie, 1-distinct). In contrast, the Safe Harbor policy is able to significantly reduce privacy risk, where only 0.072% (601/833,235) of individuals are uniquely identifiable, and the majority of the population is 3000 indistinguishable (ie the population is expected to share common attributes with 3000 or less people).
Through the experiments based on real-world patient data, this work illustrates that the results of g-distinct analysis about Chinese patient privacy risk are similar to those from a previous US study, in which data from different organizations/regions might be vulnerable to different reidentification risks under different policies. This work provides reference to Chinese health care entities for estimating patients' privacy risk during data sharing, which laid the foundation of privacy risk study about Chinese patients' data in the future.
患者隐私是全球普遍存在的问题。许多现有研究已证明与生物医学数据共享相关的潜在隐私风险。由于数据共享和分析的需求不断增加,医疗保健数据隐私正受到更多关注。然而,为了更好地保护生物医学数据隐私,首先评估隐私风险至关重要。
在中国,卫生系统对数据去标识化没有明确规定。也不清楚诸如《健康保险流通与责任法案》(HIPAA)安全港政策之类的机制是否能提供充分保护。本研究旨在利用中国医院患者数据进行一项试点研究,以了解并量化中国患者的隐私风险。
我们使用g-独特性分析来评估HIPAA安全港方法应用于中国患者数据时的重新识别风险。更具体地说,我们通过假设攻击者具有来自公共领域的患者背景知识,基于HIPAA安全港和有限数据集政策来估计风险。
在中国33个省级行政区对83万患者(数据字段包括出生日期、性别以及根据家庭住址生成的替代邮政编码)进行了实验。在有限数据集政策下,根据g-独特性度量(即1-独特性),19.58%(163,262/833,235)的人群可被唯一识别。相比之下,安全港政策能够显著降低隐私风险,其中只有0.072%(601/833,235)的个体可被唯一识别,并且大多数人群是3000不可区分的(即预计该人群与3000人或更少的人具有共同属性)。
通过基于真实世界患者数据的实验,本研究表明关于中国患者隐私风险的g-独特性分析结果与先前美国的一项研究相似,在该研究中,不同组织/地区的数据在不同政策下可能面临不同的重新识别风险。本研究为中国医疗保健实体在数据共享期间估计患者隐私风险提供了参考,为未来中国患者数据的隐私风险研究奠定了基础。