Suppr超能文献

DIRI:使用大语言模型进行对抗性患者重新识别以评估临床文本匿名化

DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization.

作者信息

Morris John X, Campion Thomas R, Nutheti Sri Laasya, Peng Yifan, Raj Akhil, Zabih Ramin, Cole Curtis L

机构信息

Cornell Tech, New York, NY.

出版信息

AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:355-364. eCollection 2025.

Abstract

Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.

摘要

共享受保护的健康信息(PHI)对于推进生物医学研究至关重要。在数据分发之前,从业者通常会进行去识别处理,以去除文本中包含的任何PHI。当代去识别方法是在高度饱和的数据集上进行评估的(工具实现了近乎完美的准确性),这可能无法反映真实世界临床文本的全部变异性或复杂性,并且对其进行标注资源密集,这是实际应用的一个障碍。为了弥补这一差距,我们开发了一种对抗方法,使用大语言模型(LLM)来重新识别与编辑后的临床记录相对应的患者,并使用一种新颖的去识别/重新识别(DIRI)方法评估性能。我们的方法使用大语言模型来重新识别与编辑后的临床记录相对应的患者。我们在威尔康奈尔医学院的医学数据上展示了我们的方法,这些数据使用三种去识别工具进行了匿名化处理:基于规则的Philter和两种基于深度学习的模型,BiLSTM-CRF和ClinicalBERT。尽管ClinicalBERT是最有效的,掩盖了所有识别出的个人身份信息(PII),但我们的工具仍能重新识别出9%的临床记录。我们的研究突出了当前去识别技术的重大弱点,同时提供了一个用于迭代开发和改进的工具。

相似文献

本文引用的文献

3
Evaluation of Automated Public De-Identification Tools on a Corpus of Radiology Reports.放射学报告语料库中自动去识别工具的评估
Radiol Artif Intell. 2020 Oct 14;2(6):e190137. doi: 10.1148/ryai.2020190137. eCollection 2020 Nov.
6
The discombobulation of de-identification.去识别化的混乱状态。
Nat Biotechnol. 2016 Nov 8;34(11):1102-1103. doi: 10.1038/nbt.3696.
10
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验