用于隐私保护的文本去识别化：对其对临床文本信息内容影响的一项研究

Text de-identification for privacy protection: a study of its impact on clinical text information content.

作者信息

Meystre Stéphane M, Ferrández Óscar, Friedlin F Jeffrey, South Brett R, Shen Shuying, Samore Matthew H

机构信息

Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States; VA Health Care System, Salt Lake City, UT, United States.

Nuance Communications Inc., Burlington, MA, United States.

出版信息

J Biomed Inform. 2014 Aug;50:142-50. doi: 10.1016/j.jbi.2014.01.011. Epub 2014 Feb 3.

DOI:10.1016/j.jbi.2014.01.011

PMID:24502938

Abstract

As more and more electronic clinical information is becoming easier to access for secondary uses such as clinical research, approaches that enable faster and more collaborative research while protecting patient privacy and confidentiality are becoming more important. Clinical text de-identification offers such advantages but is typically a tedious manual process. Automated Natural Language Processing (NLP) methods can alleviate this process, but their impact on subsequent uses of the automatically de-identified clinical narratives has only barely been investigated. In the context of a larger project to develop and investigate automated text de-identification for Veterans Health Administration (VHA) clinical notes, we studied the impact of automated text de-identification on clinical information in a stepwise manner. Our approach started with a high-level assessment of clinical notes informativeness and formatting, and ended with a detailed study of the overlap of select clinical information types and Protected Health Information (PHI). To investigate the informativeness (i.e., document type information, select clinical data types, and interpretation or conclusion) of VHA clinical notes, we used five different existing text de-identification systems. The informativeness was only minimally altered by these systems while formatting was only modified by one system. To examine the impact of de-identification on clinical information extraction, we compared counts of SNOMED-CT concepts found by an open source information extraction application in the original (i.e., not de-identified) version of a corpus of VHA clinical notes, and in the same corpus after de-identification. Only about 1.2-3% less SNOMED-CT concepts were found in de-identified versions of our corpus, and many of these concepts were PHI that was erroneously identified as clinical information. To study this impact in more details and assess how generalizable our findings were, we examined the overlap between select clinical information annotated in the 2010 i2b2 NLP challenge corpus and automatic PHI annotations from our best-of-breed VHA clinical text de-identification system (nicknamed 'BoB'). Overall, only 0.81% of the clinical information exactly overlapped with PHI, and 1.78% partly overlapped. We conclude that automated text de-identification's impact on clinical information is small, but not negligible, and that improved clinical acronyms and eponyms disambiguation could significantly reduce this impact.

摘要

随着越来越多的电子临床信息变得更容易用于诸如临床研究等二次用途，在保护患者隐私和保密性的同时，能够实现更快、更具协作性研究的方法变得越发重要。临床文本去识别化具有这样的优势，但通常是一个繁琐的手动过程。自动化自然语言处理（NLP）方法可以缓解这一过程，但其对自动去识别化后的临床叙述后续使用的影响几乎未得到研究。在一个更大的项目背景下，该项目旨在开发和研究针对退伍军人健康管理局（VHA）临床记录的自动化文本去识别化，我们逐步研究了自动化文本去识别化对临床信息的影响。我们的方法首先对临床记录的信息性和格式进行高层次评估，最后对选定的临床信息类型与受保护健康信息（PHI）的重叠情况进行详细研究。为了研究VHA临床记录的信息性（即文档类型信息、选定的临床数据类型以及解释或结论），我们使用了五个不同的现有文本去识别化系统。这些系统对信息性的改变极小，而格式仅被一个系统修改。为了检验去识别化对临床信息提取的影响，我们比较了一个开源信息提取应用程序在VHA临床记录语料库的原始版本（即未去识别化版本）以及去识别化后的同一语料库中发现的SNOMED-CT概念数量。在我们语料库的去识别化版本中，发现的SNOMED-CT概念仅减少了约1.2 - 3%，而且其中许多概念是被错误识别为临床信息的PHI。为了更详细地研究这种影响并评估我们的发现具有多大的普遍性，我们检查了2010年i2b2 NLP挑战赛语料库中注释的选定临床信息与我们最佳的VHA临床文本去识别化系统（昵称为“BoB”）的自动PHI注释之间的重叠情况。总体而言，只有0.81%的临床信息与PHI完全重叠，1.78%部分重叠。我们得出结论，自动化文本去识别化对临床信息的影响较小，但并非可以忽略不计，并且改进临床首字母缩略词和同名异物的消歧可以显著减少这种影响。