Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91058, Erlangen, Germany.
Sci Rep. 2022 Sep 1;12(1):14851. doi: 10.1038/s41598-022-19045-3.
With the rise and ever-increasing potential of deep learning techniques in recent years, publicly available medical datasets became a key factor to enable reproducible development of diagnostic algorithms in the medical domain. Medical data contains sensitive patient-related information and is therefore usually anonymized by removing patient identifiers, e.g., patient names before publication. To the best of our knowledge, we are the first to show that a well-trained deep learning system is able to recover the patient identity from chest X-ray data. We demonstrate this using the publicly available large-scale ChestX-ray14 dataset, a collection of 112,120 frontal-view chest X-ray images from 30,805 unique patients. Our verification system is able to identify whether two frontal chest X-ray images are from the same person with an AUC of 0.9940 and a classification accuracy of 95.55%. We further highlight that the proposed system is able to reveal the same person even ten and more years after the initial scan. When pursuing a retrieval approach, we observe an mAP@R of 0.9748 and a precision@1 of 0.9963. Furthermore, we achieve an AUC of up to 0.9870 and a precision@1 of up to 0.9444 when evaluating our trained networks on external datasets such as CheXpert and the COVID-19 Image Data Collection. Based on this high identification rate, a potential attacker may leak patient-related information and additionally cross-reference images to obtain more information. Thus, there is a great risk of sensitive content falling into unauthorized hands or being disseminated against the will of the concerned patients. Especially during the COVID-19 pandemic, numerous chest X-ray datasets have been published to advance research. Therefore, such data may be vulnerable to potential attacks by deep learning-based re-identification algorithms.
近年来,深度学习技术的兴起和不断增长的潜力使得公开的医疗数据集成为在医学领域中实现可重复开发诊断算法的关键因素。医疗数据包含敏感的患者相关信息,因此通常在发布前通过删除患者标识符(例如患者姓名)来实现匿名化。据我们所知,我们是第一个展示经过良好训练的深度学习系统能够从胸部 X 光数据中恢复患者身份的人。我们使用公开的大规模 ChestX-ray14 数据集来证明这一点,该数据集包含来自 30805 个独特患者的 112120 张正面胸部 X 光图像。我们的验证系统能够识别两张正面胸部 X 光图像是否来自同一个人,AUC 为 0.9940,分类准确率为 95.55%。我们进一步强调,即使在初始扫描十年甚至更长时间后,该系统也能够识别出同一个人。当采用检索方法时,我们观察到 mAP@R 为 0.9748,精度@1 为 0.9963。此外,当我们在 CheXpert 和 COVID-19 图像数据集等外部数据集上评估我们训练的网络时,我们可以达到高达 0.9870 的 AUC 和高达 0.9444 的精度@1。基于如此高的识别率,潜在攻击者可能会泄露患者相关信息,并交叉参考图像以获取更多信息。因此,敏感内容有可能落入未经授权的人手中,或者违背有关患者的意愿被传播。特别是在 COVID-19 大流行期间,为了推进研究,已经发布了许多胸部 X 光数据集。因此,这些数据可能容易受到基于深度学习的重新识别算法的潜在攻击。