Morris John X, Campion Thomas R, Nutheti Sri Laasya, Peng Yifan, Raj Akhil, Zabih Ramin, Cole Curtis L
Cornell Tech, New York, NY.
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:355-364. eCollection 2025.
Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.
共享受保护的健康信息(PHI)对于推进生物医学研究至关重要。在数据分发之前,从业者通常会进行去识别处理,以去除文本中包含的任何PHI。当代去识别方法是在高度饱和的数据集上进行评估的(工具实现了近乎完美的准确性),这可能无法反映真实世界临床文本的全部变异性或复杂性,并且对其进行标注资源密集,这是实际应用的一个障碍。为了弥补这一差距,我们开发了一种对抗方法,使用大语言模型(LLM)来重新识别与编辑后的临床记录相对应的患者,并使用一种新颖的去识别/重新识别(DIRI)方法评估性能。我们的方法使用大语言模型来重新识别与编辑后的临床记录相对应的患者。我们在威尔康奈尔医学院的医学数据上展示了我们的方法,这些数据使用三种去识别工具进行了匿名化处理:基于规则的Philter和两种基于深度学习的模型,BiLSTM-CRF和ClinicalBERT。尽管ClinicalBERT是最有效的,掩盖了所有识别出的个人身份信息(PII),但我们的工具仍能重新识别出9%的临床记录。我们的研究突出了当前去识别技术的重大弱点,同时提供了一个用于迭代开发和改进的工具。