Suppr超能文献

机器给予,机器又夺走:隐藏在明处的鹦鹉攻击对临床文本去识别。

The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

机构信息

Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA.

Privacy Analytics Inc, Ottawa, Ontario, Canada.

出版信息

J Am Med Inform Assoc. 2019 Dec 1;26(12):1536-1544. doi: 10.1093/jamia/ocz114.

Abstract

OBJECTIVE

Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus.

MATERIALS AND METHODS

We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy.

RESULTS

The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected.

DISCUSSION AND CONCLUSION

A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification.

摘要

目的

临床语料库可以通过机器学习自动标记器和隐藏在明处(HIPS)重新合成的组合进行去识别。后者用随机替身替换检测到的个人身份信息(PII),从而允许泄露的 PII 混合或“隐藏在明处”。我们评估了恶意攻击者在这样的语料库中暴露泄露的 PII 的程度。

材料和方法

我们模拟了一种情况,即一个机构(防御者)从华盛顿州一个大型综合医疗服务提供系统外部共享了 800 份实际门诊临床就诊记录的语料库。这些记录通过机器学习 PII 标记器和 HIPS 重新合成进行了去识别。恶意攻击者获取并执行了鹦鹉攻击,旨在暴露该语料库中泄露的 PII。具体来说,攻击者通过手动注释发布语料库中一半的所有类似 PII 的内容来模拟防御者的过程,在这些数据上训练 PII 标记器,并使用训练好的模型标记其余的就诊记录。攻击者假设未标记的标识符将是可通过手动审查发现的泄露 PII。我们使用泄漏检测率和准确性的度量来评估攻击者的成功。

结果

攻击者正确地假设语料库中有 310 个实际 PII 泄漏中有 211 个(68%)是泄漏,错误地假设有 191 个重新合成的 PII 实例也是泄漏。三分之一的实际泄漏未被检测到。

讨论与结论

对通过机器学习 HIPS 重新合成去识别的临床文本中的泄露 PII 进行恶意鹦鹉攻击可以削弱但不能消除 HIPS 去识别的保护效果。

相似文献

3
Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.
J Am Med Inform Assoc. 2013 Mar-Apr;20(2):342-8. doi: 10.1136/amiajnl-2012-001034. Epub 2012 Jul 6.
5
Nonspecific deidentification of date-like text in deidentified clinical notes enables reidentification of dates.
J Am Med Inform Assoc. 2022 Oct 7;29(11):1967-1971. doi: 10.1093/jamia/ocac147.
7
Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods.
J Am Med Inform Assoc. 2023 Jan 18;30(2):318-328. doi: 10.1093/jamia/ocac219.
8
A study of deep learning methods for de-identification of clinical notes in cross-institute settings.
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.
9
Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.
Methods Inf Med. 2016 Aug 5;55(4):356-64. doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.
10
Defender-Attacker Games with Asymmetric Player Utilities.
Risk Anal. 2020 Feb;40(2):408-420. doi: 10.1111/risa.13399. Epub 2019 Sep 17.

引用本文的文献

1
Robust privacy amidst innovation with large language models through a critical assessment of the risks.
J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.
2
Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods.
J Am Med Inform Assoc. 2023 Jan 18;30(2):318-328. doi: 10.1093/jamia/ocac219.
4
Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.
Patterns (N Y). 2021 May 12;2(6):100255. doi: 10.1016/j.patter.2021.100255. eCollection 2021 Jun 11.

本文引用的文献

1
Scalable Iterative Classification for Sanitizing Large-Scale Datasets.
IEEE Trans Knowl Data Eng. 2017 Mar 1;29(3):698-711. doi: 10.1109/TKDE.2016.2628180. Epub 2016 Nov 11.
2
Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach.
Am J Hum Genet. 2017 Feb 2;100(2):316-322. doi: 10.1016/j.ajhg.2016.12.002. Epub 2017 Jan 5.
3
De-identification of patient notes with recurrent neural networks.
J Am Med Inform Assoc. 2017 May 1;24(3):596-606. doi: 10.1093/jamia/ocw156.
4
Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.
Methods Inf Med. 2016 Aug 5;55(4):356-64. doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.
5
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.
6
Combining knowledge- and data-driven methods for de-identification of clinical narratives.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S53-S59. doi: 10.1016/j.jbi.2015.06.029. Epub 2015 Jul 22.
7
R-U policy frontiers for health data de-identification.
J Am Med Inform Assoc. 2015 Sep;22(5):1029-41. doi: 10.1093/jamia/ocv004. Epub 2015 Apr 24.
8
Data use under the NIH GWAS data sharing policy and future directions.
Nat Genet. 2014 Sep;46(9):934-8. doi: 10.1038/ng.3062.
9
Systematic Poisoning Attacks on and Defenses for Machine Learning in Healthcare.
IEEE J Biomed Health Inform. 2015 Nov;19(6):1893-905. doi: 10.1109/JBHI.2014.2344095. Epub 2014 Jul 30.
10
BoB, a best-of-breed automated text de-identification system for VHA clinical documents.
J Am Med Inform Assoc. 2013 Jan 1;20(1):77-83. doi: 10.1136/amiajnl-2012-001020. Epub 2012 Sep 4.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验