使用正则表达式规则和预训练的BERT进行伪标签标注以实现临床记录的去识别化。

De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.

作者信息

An Jiyong, Kim Jiyun, Sunwoo Leonard, Baek Hyunyoung, Yoo Sooyoung, Lee Seunggeun

机构信息

Graduate School of Data Science, Seoul National University, Seoul, South Korea.

Department of Radiology, Seoul National University Bundang Hospital, Seongnam, South Korea.

出版信息

BMC Med Inform Decis Mak. 2025 Feb 17;25(1):82. doi: 10.1186/s12911-025-02913-z.

DOI:10.1186/s12911-025-02913-z

PMID:39962485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11831849/

Abstract

BACKGROUND

De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea.

METHODS

Our study utilized a comprehensive dataset stored in the Note table of the OMOP Common Data Model at Seoul National University Bundang Hospital. This dataset includes 11,181,617 radiology and 9,282,477 notes from various other departments (non-radiology reports). From this, 0.1% of the reports (11,182) were randomly selected for training and validation purposes. We used two de-identification strategies to improve performance with limited and few annotated data. First, a rule-based approach is used to construct regular expressions on the 1,112 notes annotated by domain experts. Second, by using the regular expressions as label-er, we applied a semi-supervised approach to fine-tune a pre-trained Korean BERT model with pseudo-labeled notes.

RESULTS

Validation was conducted using 342 radiology and 12 non-radiology notes labeled at the token level. Our rule-based approach achieved 97.2% precision, 93.7% recall, and 96.2% F1 score from the department of radiology notes. For machine learning approach, KoBERT-NER that is fine-tuned with 32,000 automatically pseudo-labeled notes achieved 96.5% precision, 97.6% recall, and 97.1% F1 score.

CONCLUSION

By combining a rule-based approach and machine learning in a semi-supervised way, our results show that the performance of de-identification can be improved.

摘要

背景

对临床记录进行去识别化处理对于在医学研究中利用非结构化文本数据中的丰富信息至关重要。然而，韩国在从临床记录中去除个人信息方面所做的工作有限。

方法

我们的研究使用了首尔国立大学盆唐医院OMOP通用数据模型的Note表中存储的综合数据集。该数据集包括11,181,617份放射学记录和9,282,477份来自其他各个科室的记录（非放射学报告）。从中随机抽取0.1%的报告（11,182份）用于训练和验证。我们使用了两种去识别化策略来在有限且标注数据较少的情况下提高性能。首先，基于规则的方法用于根据领域专家标注的1,112份记录构建正则表达式。其次，通过将正则表达式用作标注器，我们应用半监督方法使用伪标注记录对预训练的韩国BERT模型进行微调。

结果

使用342份在词元级别标注的放射学记录和12份非放射学记录进行验证。我们基于规则的方法在放射学记录部门实现了97.2%的精确率、93.7%的召回率和96.2%的F1分数。对于机器学习方法，使用32,000份自动伪标注记录进行微调的KoBERT-NER实现了96.5%的精确率、97.6%的召回率和97.1%的F1分数。

结论

通过以半监督方式结合基于规则的方法和机器学习，我们的结果表明可以提高去识别化的性能。

相似文献

De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.

BMC Med Inform Decis Mak. 2025 Feb 17;25(1):82. doi: 10.1186/s12911-025-02913-z.

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice.

Int J Med Inform. 2023 May;173:105021. doi: 10.1016/j.ijmedinf.2023.105021. Epub 2023 Feb 11.

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.

BMC Med Inform Decis Mak. 2024 Feb 16;24(1):54. doi: 10.1186/s12911-024-02422-5.

Multifaceted Natural Language Processing Task-Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation.

JMIR Med Inform. 2024 Oct 30;12:e52897. doi: 10.2196/52897.

Automated anonymization of radiology reports: comparison of publicly available natural language processing and large language models.

Eur Radiol. 2025 May;35(5):2634-2641. doi: 10.1007/s00330-024-11148-x. Epub 2024 Oct 31.

A De-identification method for bilingual clinical texts of various note types.

J Korean Med Sci. 2015 Jan;30(1):7-15. doi: 10.3346/jkms.2015.30.1.7. Epub 2014 Dec 23.

A De-Identification Model for Korean Clinical Notes: Using Deep Learning Models.

Stud Health Technol Inform. 2024 Jan 25;310:1456-1457. doi: 10.3233/SHTI231242.

Identification of asthma control factor in clinical notes using a hybrid deep learning model.

BMC Med Inform Decis Mak. 2021 Nov 9;21(Suppl 7):272. doi: 10.1186/s12911-021-01633-4.

Extracting comprehensive clinical information for breast cancer using deep learning methods.

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

引用本文的文献

Experience of Cardiovascular and Cerebrovascular Disease Surgery Patients: Sentiment Analysis Using the Korean Bidirectional Encoder Representations from Transformers (KoBERT) Model.

JMIR Med Inform. 2025 May 30;13:e65127. doi: 10.2196/65127.

本文引用的文献

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation.

JMIR Med Inform. 2022 Aug 30;10(8):e38154. doi: 10.2196/38154.

Deidentification of free-text medical records using pre-trained bidirectional transformers.

Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:214-221. doi: 10.1145/3368555.3384455. Epub 2020 Apr 2.

Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes.

NPJ Digit Med. 2020 Apr 14;3:57. doi: 10.1038/s41746-020-0258-y. eCollection 2020.

Customization scenarios for de-identification of clinical notes.

BMC Med Inform Decis Mak. 2020 Jan 30;20(1):14. doi: 10.1186/s12911-020-1026-2.

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

MIMIC-III, a freely accessible critical care database.

Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.

A De-identification method for bilingual clinical texts of various note types.

J Korean Med Sci. 2015 Jan;30(1):7-15. doi: 10.3346/jkms.2015.30.1.7. Epub 2014 Dec 23.

De-identification of clinical notes in French: towards a protocol for reference corpus development.

J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.

High rate EHR adoption in Korea and health IT rise in Asia.

Int J Med Inform. 2012 Sep;81(9):649-50. doi: 10.1016/j.ijmedinf.2012.04.010. Epub 2012 May 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用正则表达式规则和预训练的BERT进行伪标签标注以实现临床记录的去识别化。

De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献