深度学习方法在跨机构环境下对临床记录进行去识别的研究。

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

机构信息

Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Clinical and Translational Research Building 2004 Mowry Road, PO Box 100177, Gainesville, Florida, USA.

出版信息

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

DOI:10.1186/s12911-019-0935-4

PMID:31801524

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6894104/

Abstract

BACKGROUND

De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.

METHODS

We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.

RESULTS

Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.

CONCLUSIONS

It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

摘要

背景

去识别是一项关键技术，可在保护患者隐私和机密性的同时，方便使用非结构化临床文本。临床自然语言处理 (NLP) 社区已经投入大量精力开发用于临床笔记去识别的方法和语料库。这些带注释的语料库是开发用于在当地医院自动识别临床文本的系统的宝贵资源。然而，现有研究通常利用来自同一机构的训练和测试数据。很少有研究探索跨机构环境下的自动去识别。本研究的目的是在跨机构环境下检查基于深度学习的去识别方法，识别瓶颈，并提供潜在的解决方案。

方法

我们使用来自佛罗里达大学健康中心的总共 500 份临床笔记创建了一个去识别语料库，使用 2014 年 i2b2/UTHealth 语料库开发了基于深度学习的去识别模型，并使用 UF 语料库评估了性能。我们比较了五种不同的词向量，这些词向量分别是使用通用英语文本、去识别的临床文本和生物医学文献训练得到的，探索了词汇和语言特征，并比较了使用 UF 笔记和资源定制深度学习模型的两种策略。

结果

使用通用英语语料库预训练的词向量比使用去识别的临床文本和生物医学文献训练的词向量具有更好的性能。当应用于在 UF 健康中心标注的另一个语料库时，仅使用 i2b2 语料库训练的深度学习模型的性能显著下降（严格和宽松 F1 分数从 0.9547 和 0.9646 下降到 0.8568 和 0.8958）。在跨机构设置中，语言特征可以进一步提高去识别的性能。使用 UF 笔记和资源对模型进行定制后，最佳模型分别实现了严格和宽松的 F1 分数 0.9288 和 0.9584。

结论

在跨机构设置中应用时，有必要使用本地临床文本和其他资源对去识别模型进行定制。微调是重新使用预训练参数和减少训练时间以定制使用来自不同机构的临床语料库训练的基于深度学习的去识别模型的一种潜在解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e8ab/6894104/5bead933f1da/12911_2019_935_Fig1_HTML.jpg

相似文献

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

A Study of Deep Learning Methods for De-identification of Clinical Notes at Cross Institute Settings.

Proc (IEEE Int Conf Healthc Inform). 2019 Jun;2019. doi: 10.1109/ICHI.2019.8904544. Epub 2019 Nov 21.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers.

Stud Health Technol Inform. 2019 Aug 21;264:283-287. doi: 10.3233/SHTI190228.

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.

BMC Med Inform Decis Mak. 2024 Feb 16;24(1):54. doi: 10.1186/s12911-024-02422-5.

Preliminary Evaluation of Fine-Tuning the OpenDeLD Deidentification Pipeline Across Multi-Center Corpora.

Stud Health Technol Inform. 2024 Aug 22;316:719-723. doi: 10.3233/SHTI240515.

Text de-identification for privacy protection: a study of its impact on clinical text information content.

J Biomed Inform. 2014 Aug;50:142-50. doi: 10.1016/j.jbi.2014.01.011. Epub 2014 Feb 3.

De-identification of clinical free text using natural language processing: A systematic review of current approaches.

Artif Intell Med. 2024 May;151:102845. doi: 10.1016/j.artmed.2024.102845. Epub 2024 Mar 20.

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.

J Biomed Inform. 2022 Jan;125:103971. doi: 10.1016/j.jbi.2021.103971. Epub 2021 Dec 14.

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.

J Biomed Inform. 2014 Aug;50:173-183. doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.

引用本文的文献

Predicting Respiratory Diseases Attributed to PM2.5 Air Pollution in Nairobi County Using Random Forest Model.

Int J Innov Sci Res Technol. 2024 Jul;9(7):3489-3492. doi: 10.38124/ijisrt/ijisrt24jul1521.

Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.

NPJ Digit Med. 2025 Jul 21;8(1):462. doi: 10.1038/s41746-025-01858-x.

Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing.

AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:441-450. eCollection 2025.

Robust privacy amidst innovation with large language models through a critical assessment of the risks.

J Am Med Inform Assoc. 2025 May 1;32(5):885-892. doi: 10.1093/jamia/ocaf037.

De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.

BMC Med Inform Decis Mak. 2025 Feb 17;25(1):82. doi: 10.1186/s12911-025-02913-z.

De-identification is not enough: a comparison between de-identified and synthetic clinical notes.

Sci Rep. 2024 Nov 29;14(1):29669. doi: 10.1038/s41598-024-81170-y.

Large language models in health care: Development, applications, and challenges.

Health Care Sci. 2023 Jul 24;2(4):255-263. doi: 10.1002/hcs2.61. eCollection 2023 Aug.

Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1892-1903. doi: 10.1093/jamia/ocae078.

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.

Methods Inf Med. 2024 May;63(1-02):21-34. doi: 10.1055/s-0044-1778693. Epub 2024 Mar 5.

De-identification of free text data containing personal health information: a scoping review of reviews.

Int J Popul Data Sci. 2023 Dec 12;8(1):2153. doi: 10.23889/ijpds.v8i1.2153. eCollection 2023.

本文引用的文献

Combine Factual Medical Knowledge and Distributed Word Representation to Improve Clinical Named Entity Recognition.

AMIA Annu Symp Proc. 2018 Dec 5;2018:1110-1117. eCollection 2018.

MADEx: A System for Detecting Medications, Adverse Drug Events, and Their Relations from Clinical Notes.

Drug Saf. 2019 Jan;42(1):123-133. doi: 10.1007/s40264-018-0761-0.

Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis.

IEEE J Biomed Health Inform. 2018 Sep;22(5):1589-1604. doi: 10.1109/JBHI.2017.2767063. Epub 2017 Oct 27.

Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review.

J Am Med Inform Assoc. 2018 Oct 1;25(10):1419-1428. doi: 10.1093/jamia/ocy068.

Clinical Named Entity Recognition Using Deep Learning Models.

AMIA Annu Symp Proc. 2018 Apr 16;2017:1812-1819. eCollection 2017.

De-identification of medical records using conditional random fields and long short-term memory networks.

J Biomed Inform. 2017 Nov;75S:S43-S53. doi: 10.1016/j.jbi.2017.10.003. Epub 2017 Oct 13.

Patient Privacy in the Era of Big Data.

Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13.

De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1.

J Biomed Inform. 2017 Nov;75S:S4-S18. doi: 10.1016/j.jbi.2017.06.011. Epub 2017 Jun 11.

A hybrid approach to automatic de-identification of psychiatric notes.

J Biomed Inform. 2017 Nov;75S:S19-S27. doi: 10.1016/j.jbi.2017.06.006. Epub 2017 Jun 7.

De-identification of clinical notes via recurrent neural network and conditional random field.

J Biomed Inform. 2017 Nov;75S:S34-S42. doi: 10.1016/j.jbi.2017.05.023. Epub 2017 Jun 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

深度学习方法在跨机构环境下对临床记录进行去识别的研究。

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献