Heider Paul M, Obeid Jihad S, Meystre Stéphane M
Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC.
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:241-250. eCollection 2020.
A growing quantity of health data is being stored in Electronic Health Records (EHR). The free-text section of these clinical notes contains important patient and treatment information for research but also contains Personally Identifiable Information (PII), which cannot be freely shared within the research community without compromising patient confidentiality and privacy rights. Significant work has been invested in investigating automated approaches to text de-identification, the process of removing or redacting PII. Few studies have examined the performance of existing de-identification pipelines in a controlled comparative analysis. In this study, we use publicly available corpora to analyze speed and accuracy differences between three de-identification systems that can be run off-the-shelf: Amazon Comprehend Medical PHId, Clinacuity's CliniDeID, and the National Library of Medicine's Scrubber. No single system dominated all the compared metrics. NLM Scrubber was the fastest while CliniDeID generally had the highest accuracy.
越来越多的健康数据被存储在电子健康记录(EHR)中。这些临床记录的自由文本部分包含了用于研究的重要患者和治疗信息,但也包含个人身份信息(PII),在不损害患者保密性和隐私权的情况下,这些信息不能在研究社区内自由共享。人们已经投入了大量工作来研究文本去识别化的自动化方法,即去除或编辑PII的过程。很少有研究在受控的比较分析中检验现有去识别化流程的性能。在本研究中,我们使用公开可用的语料库来分析三种现成的去识别化系统之间的速度和准确性差异:亚马逊理解医疗PHId、Clinacuity的CliniDeID以及美国国立医学图书馆的Scrubber。没有一个系统在所有比较指标上都占主导地位。NLM Scrubber速度最快,而CliniDeID通常准确性最高。