Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Clinical and Translational Research Building 2004 Mowry Road, PO Box 100177, Gainesville, Florida, USA.
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.
De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.
We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.
Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.
It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.
去识别是一项关键技术,可在保护患者隐私和机密性的同时,方便使用非结构化临床文本。临床自然语言处理 (NLP) 社区已经投入大量精力开发用于临床笔记去识别的方法和语料库。这些带注释的语料库是开发用于在当地医院自动识别临床文本的系统的宝贵资源。然而,现有研究通常利用来自同一机构的训练和测试数据。很少有研究探索跨机构环境下的自动去识别。本研究的目的是在跨机构环境下检查基于深度学习的去识别方法,识别瓶颈,并提供潜在的解决方案。
我们使用来自佛罗里达大学健康中心的总共 500 份临床笔记创建了一个去识别语料库,使用 2014 年 i2b2/UTHealth 语料库开发了基于深度学习的去识别模型,并使用 UF 语料库评估了性能。我们比较了五种不同的词向量,这些词向量分别是使用通用英语文本、去识别的临床文本和生物医学文献训练得到的,探索了词汇和语言特征,并比较了使用 UF 笔记和资源定制深度学习模型的两种策略。
使用通用英语语料库预训练的词向量比使用去识别的临床文本和生物医学文献训练的词向量具有更好的性能。当应用于在 UF 健康中心标注的另一个语料库时,仅使用 i2b2 语料库训练的深度学习模型的性能显著下降(严格和宽松 F1 分数从 0.9547 和 0.9646 下降到 0.8568 和 0.8958)。在跨机构设置中,语言特征可以进一步提高去识别的性能。使用 UF 笔记和资源对模型进行定制后,最佳模型分别实现了严格和宽松的 F1 分数 0.9288 和 0.9584。
在跨机构设置中应用时,有必要使用本地临床文本和其他资源对去识别模型进行定制。微调是重新使用预训练参数和减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的一种潜在解决方案。