Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey.
J Am Med Inform Assoc. 2010 Mar-Apr;17(2):159-68. doi: 10.1136/jamia.2009.002212.
De-identified medical records are critical to biomedical research. Text de-identification software exists, including "resynthesis" components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software.
We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records.
We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule.
The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure.
The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.
去识别的医疗记录对生物医学研究至关重要。文本去识别软件已经存在,包括用合成标识符替换真实标识符的“重合成”组件。本研究的目的是评估去识别软件的有效性,并检查重合成引入的可能偏差。
我们评估了包括重合成功能在内的开源 MITRE 识别清理工具包,该工具包使用范德比尔特大学医学中心患者记录中的临床文本。我们研究了来自 500 多个患者文件的四个记录类别,包括实验室报告、药物医嘱、出院总结和临床笔记。我们在真实和重合成记录上训练和测试了去识别工具。
我们根据 HIPAA 安全港规则指定的受保护健康标识符的检测来衡量精度、召回率、F 度量和准确性。
该去识别工具是在真实和重合成的范德比尔特记录集合上进行训练和测试的。在真实记录上进行训练和测试的结果分别为 0.990 的准确性和 0.960 的 F 度量。当在重合成记录上进行训练和测试时,结果提高到 0.998 的准确性和 0.980 的 F 度量,但当在真实记录上进行训练并在重合成记录上进行测试时,结果适度恶化,准确率为 0.989,F 度量为 0.862。此外,当在重合成记录上进行训练并在真实记录上进行测试时,结果显著下降,准确率为 0.942,F 度量为 0.728。
当训练集和测试集同质时(即都是真实记录或重合成记录),去识别工具的准确性很高。重合成组件使数据正则化,使其不那么“真实”,从而导致性能下降,特别是在使用重合成数据进行训练并在真实数据上进行测试时。