机器给予，机器又夺走：隐藏在明处的鹦鹉攻击对临床文本去识别。

The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

机构信息

Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA.

Privacy Analytics Inc, Ottawa, Ontario, Canada.

出版信息

J Am Med Inform Assoc. 2019 Dec 1;26(12):1536-1544. doi: 10.1093/jamia/ocz114.

DOI:10.1093/jamia/ocz114

PMID:31390016

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6857511/

Abstract

OBJECTIVE

Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus.

MATERIALS AND METHODS

We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy.

RESULTS

The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected.

DISCUSSION AND CONCLUSION

A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification.

摘要

目的

临床语料库可以通过机器学习自动标记器和隐藏在明处（HIPS）重新合成的组合进行去识别。后者用随机替身替换检测到的个人身份信息（PII），从而允许泄露的 PII 混合或“隐藏在明处”。我们评估了恶意攻击者在这样的语料库中暴露泄露的 PII 的程度。

材料和方法

我们模拟了一种情况，即一个机构（防御者）从华盛顿州一个大型综合医疗服务提供系统外部共享了 800 份实际门诊临床就诊记录的语料库。这些记录通过机器学习 PII 标记器和 HIPS 重新合成进行了去识别。恶意攻击者获取并执行了鹦鹉攻击，旨在暴露该语料库中泄露的 PII。具体来说，攻击者通过手动注释发布语料库中一半的所有类似 PII 的内容来模拟防御者的过程，在这些数据上训练 PII 标记器，并使用训练好的模型标记其余的就诊记录。攻击者假设未标记的标识符将是可通过手动审查发现的泄露 PII。我们使用泄漏检测率和准确性的度量来评估攻击者的成功。

结果

攻击者正确地假设语料库中有 310 个实际 PII 泄漏中有 211 个（68%）是泄漏，错误地假设有 191 个重新合成的 PII 实例也是泄漏。三分之一的实际泄漏未被检测到。

讨论与结论

对通过机器学习 HIPS 重新合成去识别的临床文本中的泄露 PII 进行恶意鹦鹉攻击可以削弱但不能消除 HIPS 去识别的保护效果。

相似文献

The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.机器给予，机器又夺走：隐藏在明处的鹦鹉攻击对临床文本去识别。

J Am Med Inform Assoc. 2019 Dec 1;26(12):1536-1544. doi: 10.1093/jamia/ocz114.

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.临床去标识文本的“以明掩暗”抵御人类读者敌对重新识别攻击的弹性。

J Am Med Inform Assoc. 2020 Jul 1;27(9):1374-1382. doi: 10.1093/jamia/ocaa095.

Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.藏于无形之中：利用逼真的替身来减少临床文本中受保护健康信息的暴露。

J Am Med Inform Assoc. 2013 Mar-Apr;20(2):342-8. doi: 10.1136/amiajnl-2012-001034. Epub 2012 Jul 6.

Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study.基于人在回路深度学习的电子病历自由文本数据去识别化的网络应用程序：开发与可用性研究

Interact J Med Res. 2023 Aug 25;12:e46322. doi: 10.2196/46322.

Nonspecific deidentification of date-like text in deidentified clinical notes enables reidentification of dates.去识别化的临床记录中类似日期的非特定信息的去识别化处理可使日期被重新识别。

J Am Med Inform Assoc. 2022 Oct 7;29(11):1967-1971. doi: 10.1093/jamia/ocac147.

An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools: Multisystem and Multicorpus Study.应用于临床文本去标识化自然语言处理工具的可扩展评估框架：多系统和多语料库研究。

J Med Internet Res. 2024 May 28;26:e55676. doi: 10.2196/55676.

Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods.基于 Transformer 和“隐藏在明处”规则的放射学报告自动去识别化。

J Am Med Inform Assoc. 2023 Jan 18;30(2):318-328. doi: 10.1093/jamia/ocac219.

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.榨取成果是否值得？多名人工标注者在临床文本去识别化中的成本与收益

Methods Inf Med. 2016 Aug 5;55(4):356-64. doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.

Defender-Attacker Games with Asymmetric Player Utilities.具有非对称玩家效用的防御-攻击博弈。

Risk Anal. 2020 Feb;40(2):408-420. doi: 10.1111/risa.13399. Epub 2019 Sep 17.

引用本文的文献

Robust privacy amidst innovation with large language models through a critical assessment of the risks.通过对风险的批判性评估，在大语言模型创新中实现强大的隐私保护。

J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.

J Am Med Inform Assoc. 2023 Jan 18;30(2):318-328. doi: 10.1093/jamia/ocac219.

The Potential of Research Drawing on Clinical Free Text to Bring Benefits to Patients in the United Kingdom: A Systematic Review of the Literature.利用临床自由文本进行研究为英国患者带来益处的潜力：文献系统综述

Front Digit Health. 2021 Feb 10;3:606599. doi: 10.3389/fdgth.2021.606599. eCollection 2021.

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.通过集成学习构建用于电子健康记录的一流自动去识别工具。

Patterns (N Y). 2021 May 12;2(6):100255. doi: 10.1016/j.patter.2021.100255. eCollection 2021 Jun 11.

J Am Med Inform Assoc. 2020 Jul 1;27(9):1374-1382. doi: 10.1093/jamia/ocaa095.

本文引用的文献

Scalable Iterative Classification for Sanitizing Large-Scale Datasets.用于清理大规模数据集的可扩展迭代分类

IEEE Trans Knowl Data Eng. 2017 Mar 1;29(3):698-711. doi: 10.1109/TKDE.2016.2628180. Epub 2016 Nov 11.

Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach.在促进隐私保护的同时扩大对大规模基因组数据的访问：一种博弈论方法。

Am J Hum Genet. 2017 Feb 2;100(2):316-322. doi: 10.1016/j.ajhg.2016.12.002. Epub 2017 Jan 5.

De-identification of patient notes with recurrent neural networks.使用递归神经网络对患者记录进行去识别化处理。

J Am Med Inform Assoc. 2017 May 1;24(3):596-606. doi: 10.1093/jamia/ocw156.

Methods Inf Med. 2016 Aug 5;55(4):356-64. doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.用于纵向临床记录去识别化的自动化系统：2014年i2b2/德克萨斯大学健康科学中心共享任务赛道1概述

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.

Combining knowledge- and data-driven methods for de-identification of clinical narratives.结合知识驱动和数据驱动方法对临床记录进行去识别化处理。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S53-S59. doi: 10.1016/j.jbi.2015.06.029. Epub 2015 Jul 22.

R-U policy frontiers for health data de-identification.健康数据去识别化的R-U政策前沿

J Am Med Inform Assoc. 2015 Sep;22(5):1029-41. doi: 10.1093/jamia/ocv004. Epub 2015 Apr 24.

Data use under the NIH GWAS data sharing policy and future directions.美国国立卫生研究院全基因组关联研究（GWAS）数据共享政策下的数据使用及未来方向。

Nat Genet. 2014 Sep;46(9):934-8. doi: 10.1038/ng.3062.

Systematic Poisoning Attacks on and Defenses for Machine Learning in Healthcare.系统中毒攻击与机器学习在医疗保健中的防御

IEEE J Biomed Health Inform. 2015 Nov;19(6):1893-905. doi: 10.1109/JBHI.2014.2344095. Epub 2014 Jul 30.

BoB, a best-of-breed automated text de-identification system for VHA clinical documents.BoB，一种针对 VHA 临床文档的最佳自动文本去识别系统。

J Am Med Inform Assoc. 2013 Jan 1;20(1):77-83. doi: 10.1136/amiajnl-2012-001020. Epub 2012 Sep 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验