DIRI：使用大语言模型进行对抗性患者重新识别以评估临床文本匿名化

DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization.

作者信息

Morris John X, Campion Thomas R, Nutheti Sri Laasya, Peng Yifan, Raj Akhil, Zabih Ramin, Cole Curtis L

机构信息

Cornell Tech, New York, NY.

出版信息

AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:355-364. eCollection 2025.

PMID:40502277

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12150728/

Abstract

Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.

摘要

共享受保护的健康信息（PHI）对于推进生物医学研究至关重要。在数据分发之前，从业者通常会进行去识别处理，以去除文本中包含的任何PHI。当代去识别方法是在高度饱和的数据集上进行评估的（工具实现了近乎完美的准确性），这可能无法反映真实世界临床文本的全部变异性或复杂性，并且对其进行标注资源密集，这是实际应用的一个障碍。为了弥补这一差距，我们开发了一种对抗方法，使用大语言模型（LLM）来重新识别与编辑后的临床记录相对应的患者，并使用一种新颖的去识别/重新识别（DIRI）方法评估性能。我们的方法使用大语言模型来重新识别与编辑后的临床记录相对应的患者。我们在威尔康奈尔医学院的医学数据上展示了我们的方法，这些数据使用三种去识别工具进行了匿名化处理：基于规则的Philter和两种基于深度学习的模型，BiLSTM-CRF和ClinicalBERT。尽管ClinicalBERT是最有效的，掩盖了所有识别出的个人身份信息（PII），但我们的工具仍能重新识别出9%的临床记录。我们的研究突出了当前去识别技术的重大弱点，同时提供了一个用于迭代开发和改进的工具。

相似文献

DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization.DIRI：使用大语言模型进行对抗性患者重新识别以评估临床文本匿名化

AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:355-364. eCollection 2025.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Antidepressants for pain management in adults with chronic pain: a network meta-analysis.抗抑郁药治疗成人慢性疼痛的疼痛管理：一项网络荟萃分析。

Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.

Harnessing Moderate-Sized Language Models for Reliable Patient Data Deidentification in Emergency Department Records: Algorithm Development, Validation, and Implementation Study.利用中等规模语言模型对急诊科记录中的患者数据进行可靠去识别：算法开发、验证与实施研究。

JMIR AI. 2025 Apr 1;4:e57828. doi: 10.2196/57828.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Interventions for promoting habitual exercise in people living with and beyond cancer.促进癌症患者及康复者进行习惯性锻炼的干预措施。

Cochrane Database Syst Rev. 2018 Sep 19;9(9):CD010192. doi: 10.1002/14651858.CD010192.pub3.

Effectiveness and cost-effectiveness of computer and other electronic aids for smoking cessation: a systematic review and network meta-analysis.计算机和其他电子戒烟辅助手段的有效性和成本效益：系统评价和网络荟萃分析。

Health Technol Assess. 2012;16(38):1-205, iii-v. doi: 10.3310/hta16380.

Evaluation of Vision-Language Models for Detection and Deidentification of Medical Images with Burned-In Protected Health Information.用于检测和去识别带有预嵌入受保护健康信息的医学图像的视觉语言模型评估

Radiology. 2025 Jun;315(3):e243664. doi: 10.1148/radiol.243664.

"It Was Like the Final Piece in the Puzzle for Me": A Qualitative Study on the Experiences of Autistic Women Initially Diagnosed with Borderline Personality Disorder.“对我来说，这就像拼图中的最后一块”：一项关于最初被诊断为边缘型人格障碍的自闭症女性经历的定性研究

Autism Adulthood. 2024 Dec 2;6(4):428-437. doi: 10.1089/aut.2023.0031. eCollection 2024 Dec.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

本文引用的文献

An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice.对澳大利亚全科医疗中用于患者病程记录的现有文本去识别工具的评估。

Int J Med Inform. 2023 May;173:105021. doi: 10.1016/j.ijmedinf.2023.105021. Epub 2023 Feb 11.

Managing re-identification risks while providing access to the All of Us research program.在提供对“所有人”研究计划访问权限的同时，管理重新识别风险。

J Am Med Inform Assoc. 2023 Apr 19;30(5):907-914. doi: 10.1093/jamia/ocad021.

Evaluation of Automated Public De-Identification Tools on a Corpus of Radiology Reports.放射学报告语料库中自动去识别工具的评估

Radiol Artif Intell. 2020 Oct 14;2(6):e190137. doi: 10.1148/ryai.2020190137. eCollection 2020 Nov.

Extraction of radiographic findings from unstructured thoracoabdominal computed tomography reports using convolutional neural network based natural language processing.使用基于卷积神经网络的自然语言处理从非结构化的胸腹部计算机断层扫描报告中提取影像学发现。

PLoS One. 2020 Jul 30;15(7):e0236827. doi: 10.1371/journal.pone.0236827. eCollection 2020.

Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes.受保护的健康信息过滤器（Philter）：准确且安全地去除自由文本临床记录中的身份标识信息。

NPJ Digit Med. 2020 Apr 14;3:57. doi: 10.1038/s41746-020-0258-y. eCollection 2020.

The discombobulation of de-identification.去识别化的混乱状态。

Nat Biotechnol. 2016 Nov 8;34(11):1102-1103. doi: 10.1038/nbt.3696.

MIMIC-III, a freely accessible critical care database.MIMIC-III，一个免费获取的重症监护数据库。

Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.

Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.藏于无形之中：利用逼真的替身来减少临床文本中受保护健康信息的暴露。

J Am Med Inform Assoc. 2013 Mar-Apr;20(2):342-8. doi: 10.1136/amiajnl-2012-001034. Epub 2012 Jul 6.

Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies.用于多中心研究的电子健康记录数据去识别和匿名化策略。

Med Care. 2012 Jul;50 Suppl(Suppl):S82-101. doi: 10.1097/MLR.0b013e3182585355.

Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化

BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验