Suppr超能文献

电子健康记录中文本文件的自动去识别:近期研究综述。

Automatic de-identification of textual documents in the electronic health record: a review of recent research.

机构信息

Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

出版信息

BMC Med Res Methodol. 2010 Aug 2;10:70. doi: 10.1186/1471-2288-10-70.

Abstract

BACKGROUND

In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here.

METHODS

This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers.

RESULTS

The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries.

CONCLUSIONS

In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.

摘要

背景

在美国,《健康保险流通与责任法案》(HIPAA)保护患者数据的机密性,并要求患者知情同意并获得内部审查委员会批准才能将数据用于研究目的,但如果数据被去识别化,则可以免除这些要求。为了使临床数据被认为是去识别化的,HIPAA 的“安全港”技术要求删除 18 个数据元素(称为 PHI:受保护的健康信息)。叙述性文本文件的去识别通常是手动完成的,需要大量资源。几位作者深知这些问题,他们已经研究了从电子健康记录中自动去识别叙述性文本文件的方法,本文对该领域的最新研究进行了综述。

方法

本综述重点关注最近发表的研究(1995 年后),并包括从 PubMed 的文献查询、会议记录、ACM 数字图书馆以及已收录文献中引用的有趣文献中检索到的相关出版物。

结果

文献检索返回了 200 多篇出版物。大多数出版物只关注于结构化数据的去识别而不是叙述性文本、图像的去识别,或者描述手动去识别,因此被排除在外。最后,选择了 18 篇描述自动文本去识别的出版物,对所使用的架构和方法、检测和删除的 PHI 类型、使用的外部资源以及目标临床文档类型进行了详细分析。所有的文本去识别系统都旨在识别和删除人名,并且许多系统还包括其他类型的 PHI。大多数系统只使用一种或两种特定的临床文档类型,并且主要基于两种不同的方法:模式匹配和机器学习。许多系统将这两种方法结合起来用于不同类型的 PHI,但大多数系统仅依赖于模式匹配、规则和字典。

结论

一般来说,基于字典的方法在临床文本中很少提到的 PHI 方面表现更好,但更难推广。基于机器学习的方法往往表现更好,尤其是对于字典中未提到的 PHI。最后,本文还讨论了匿名化、充分的性能和“过度清理”等问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08cd/2923159/0c7ece4d6a64/1471-2288-10-70-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验