Johnson Kipp W, De Freitas Jessica K, Glicksberg Benjamin S, Bobe Jason R, Dudley Joel T
Institute for Next Generation Healthcare, Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai, 770 Lexington Ave 15th Fl., New York, NY 10065, USA*Authors contributed equally.
Pac Symp Biocomput. 2019;24:415-426.
Anonymized electronic health records (EHR) are often used for biomedical research. One persistent concern with this type of research is the risk for re-identification of patients from their purportedly anonymized data. Here, we use the EHR of 731,850 de-identified patients to demonstrate that the average patient is unique from all others 98.4% of the time simply by examining what laboratory tests have been ordered for them. By the time a patient has visited the hospital on two separate days, they are unique in 72.3% of cases. We further present a computational study to identify how accurately the records from a single day of care can be used to re-identify patients from a set of 99 other patients. We show that, given a single visit's laboratory orders (even without result values) for a patient, we can re-identify the patient at least 25% of the time. Furthermore, we can place this patient among the top 10 most similar patients 47% of the time. Finally, we present a proof-of-concept technique using a variational autoencoder to encode laboratory results into a lower-dimensional latent space. We demonstrate that releasing latentspace encoded laboratory orders significantly improves privacy compared to releasing raw laboratory orders (<5% re-identification), while preserving information contained within the laboratory orders (AUC of >0.9 for recreating encoded values). Our findings have potential consequences for the public release of anonymized laboratory tests to the biomedical research community. We note that our findings do not imply that laboratory tests alone are personally identifiable. In the attack scenario presented here, reidentification would require a threat actor to possess an external source of laboratory values which are linked to personal identifiers at the start.
匿名电子健康记录(EHR)常用于生物医学研究。这类研究一直存在的一个担忧是,患者可能会从其所谓的匿名数据中被重新识别出来。在此,我们使用731850名去识别化患者的电子健康记录来证明,仅通过检查为患者安排了哪些实验室检查,平均而言,98.4%的情况下患者与其他所有患者都是唯一的。当患者在两天分别就诊时,在72.3%的病例中他们是唯一的。我们还进行了一项计算研究,以确定从一天的护理记录中能多准确地用于从另外99名患者的集合中重新识别患者。我们表明,给定一名患者一次就诊的实验室检查医嘱(即使没有结果值),我们至少在25%的时间里能够重新识别该患者。此外,我们在47%的时间里能够将该患者置于最相似的前10名患者之中。最后,我们提出了一种概念验证技术,使用变分自编码器将实验室检查结果编码到低维潜在空间中。我们证明,与发布原始实验室检查医嘱相比,发布潜在空间编码的实验室检查医嘱显著提高了隐私性(重新识别率<5%),同时保留了实验室检查医嘱中包含的信息(重新创建编码值的AUC>0.9)。我们的研究结果对向生物医学研究界公开发布匿名实验室检查结果具有潜在影响。我们注意到,我们的研究结果并不意味着仅实验室检查就能识别个人身份。在此呈现的攻击场景中,重新识别需要威胁行为者一开始就拥有与个人标识符相关联的外部实验室检查值来源。