Morrison Frances P, Sengupta Soumitra, Hripcsak George
Columbia University Department of Biomedical Informatics.
AMIA Annu Symp Proc. 2009 Nov 14;2009:447-51.
Effective de-identification methods are needed to support reuse of electronic health record data for research and other purposes. We investigated using two different text-processing systems in tandem as a strategy for de-identification of clinical notes. We ran 100 outpatient notes through deid.pl, from MIT's PhysioToolkit, followed by MedLEE, and we manually compared the output with original notes to determine the amount of protected health information (PHI) retained. Pipelining resulted in an overall error rate of 2%, with 2 personal names retained in output: one initial and a commonly used English term used in medicine. All retained PHI was transformed into standardized medical concepts, making re-identification less likely. Pipelining using deid.pl improved performance of MedLEE in excluding PHI from output and may be a useful strategy for de-identifying clinical data while providing computer-readable output.
需要有效的去识别方法来支持电子健康记录数据用于研究和其他目的的再利用。我们研究了串联使用两种不同的文本处理系统作为临床记录去识别的一种策略。我们将100份门诊记录通过麻省理工学院生理工具包的deid.pl,然后再通过MedLEE,并且我们手动将输出结果与原始记录进行比较,以确定保留的受保护健康信息(PHI)的数量。流水线操作导致总体错误率为2%,输出结果中保留了2个个人姓名:一个名字首字母和一个医学中常用的英语术语。所有保留的PHI都被转化为标准化医学概念,降低了重新识别的可能性。使用deid.pl的流水线操作提高了MedLEE从输出中排除PHI的性能,并且在提供计算机可读输出的同时,可能是一种用于临床数据去识别的有用策略。