• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种在中国电子健康记录中去识别受保护健康信息的有效方法:算法开发与验证

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation.

作者信息

Wang Peng, Li Yong, Yang Liang, Li Simin, Li Linfeng, Zhao Zehan, Long Shaopei, Wang Fei, Wang Hongqian, Li Ying, Wang Chengliang

机构信息

College of Computer Science, Chongqing University, Chongqing, China.

School of Computer Science, South China Normal University, Guangzhou, China.

出版信息

JMIR Med Inform. 2022 Aug 30;10(8):e38154. doi: 10.2196/38154.

DOI:10.2196/38154
PMID:36040774
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9472063/
Abstract

BACKGROUND

With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning-based, or deep learning-based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language.

OBJECTIVE

This paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification.

METHODS

We propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records.

RESULTS

We compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods.

CONCLUSIONS

Compared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.

摘要

背景

随着电子健康记录在中国的普及,数字化数据的利用对于真实世界医学研究的发展具有巨大潜力。然而,这些数据通常包含大量受保护的健康信息,直接使用这些数据可能会导致隐私问题。对电子健康记录中的受保护健康信息进行去识别化处理的任务可被视为一个命名实体识别问题。已经提出了基于规则、基于机器学习或基于深度学习的方法来解决这个问题。然而,这些方法仍然面临中文电子健康记录数据不足以及中文语言特征复杂的困难。

目的

本文提出一种方法,以克服深度神经网络的过拟合和训练数据不足的困难,从而实现中文受保护健康信息的去识别化。

方法

我们提出一种新模型,该模型将TinyBERT(基于变换器的双向编码器表征)作为文本特征提取模块,并将条件随机场方法作为预测模块,用于对中文医学电子健康记录中的受保护健康信息进行去识别化。此外,还提出了一种混合数据增强方法,该方法整合了句子生成策略和提及替换策略,以克服中文电子健康记录不足的问题。

结果

我们将我们的方法与5种使用不同BERT模型作为其特征提取模块的基线方法进行了比较。在我们收集的中文电子健康记录上的实验结果表明,我们的方法比所有基于BERT的基线方法具有更好的性能(微精度:98.7%,微召回率:99.13%,微F1分数:98.91%)和更高的效率(快40%)。

结论

与基线方法相比,TinyBERT在我们提出的增强数据集上的效率优势得以保持,同时在中文受保护健康信息去识别化任务上性能有所提高。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/391b511e0066/medinform_v10i8e38154_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/4813cc54430b/medinform_v10i8e38154_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/e89b0f66ac85/medinform_v10i8e38154_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/b45f2635295a/medinform_v10i8e38154_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/239ea160d0a2/medinform_v10i8e38154_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/391b511e0066/medinform_v10i8e38154_fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/4813cc54430b/medinform_v10i8e38154_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/e89b0f66ac85/medinform_v10i8e38154_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/b45f2635295a/medinform_v10i8e38154_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/239ea160d0a2/medinform_v10i8e38154_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/035d/9472063/391b511e0066/medinform_v10i8e38154_fig5.jpg

相似文献

1
An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation.一种在中国电子健康记录中去识别受保护健康信息的有效方法:算法开发与验证
JMIR Med Inform. 2022 Aug 30;10(8):e38154. doi: 10.2196/38154.
2
Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study.基于人在回路深度学习的电子病历自由文本数据去识别化的网络应用程序:开发与可用性研究
Interact J Med Res. 2023 Aug 25;12:e46322. doi: 10.2196/46322.
3
Extracting clinical named entity for pituitary adenomas from Chinese electronic medical records.从中文电子病历中提取垂体腺瘤的临床命名实体。
BMC Med Inform Decis Mak. 2022 Mar 23;22(1):72. doi: 10.1186/s12911-022-01810-z.
4
Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study.揭开高级人工智能语言模型在去识别汉英混合临床文本背后的秘密:开发与验证研究。
J Med Internet Res. 2024 Jan 25;26:e48443. doi: 10.2196/48443.
5
Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation.基于多语义特征,利用经过稳健优化的基于变换器预训练方法的全词掩码和卷积神经网络从电子病历中进行中文临床命名实体识别:模型开发与验证
JMIR Med Inform. 2023 May 10;11:e44597. doi: 10.2196/44597.
6
Named entity recognition of Chinese electronic medical records based on a hybrid neural network and medical MC-BERT.基于混合神经网络和医学 MC-BERT 的中文电子病历命名实体识别。
BMC Med Inform Decis Mak. 2022 Dec 1;22(1):315. doi: 10.1186/s12911-022-02059-2.
7
A machine learning based approach to identify protected health information in Chinese clinical text.基于机器学习的方法识别中文临床文本中的保护健康信息。
Int J Med Inform. 2018 Aug;116:24-32. doi: 10.1016/j.ijmedinf.2018.05.010. Epub 2018 May 22.
8
Chinese-Named Entity Recognition From Adverse Drug Event Records: Radical Embedding-Combined Dynamic Embedding-Based BERT in a Bidirectional Long Short-term Conditional Random Field (Bi-LSTM-CRF) Model.从药品不良事件记录中识别中文命名实体:基于激进嵌入与动态嵌入相结合的BERT的双向长短期条件随机场(Bi-LSTM-CRF)模型
JMIR Med Inform. 2021 Dec 1;9(12):e26407. doi: 10.2196/26407.
9
Application of Entity-BERT model based on neuroscience and brain-like cognition in electronic medical record entity recognition.基于神经科学和类脑认知的实体BERT模型在电子病历实体识别中的应用
Front Neurosci. 2023 Sep 20;17:1259652. doi: 10.3389/fnins.2023.1259652. eCollection 2023.
10
OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study.基于规则和转换器的非结构化电子健康记录文本注释的 OpenDeID 管道:去识别算法的开发和验证研究。
J Med Internet Res. 2023 Dec 6;25:e48145. doi: 10.2196/48145.

引用本文的文献

1
De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.使用正则表达式规则和预训练的BERT进行伪标签标注以实现临床记录的去识别化。
BMC Med Inform Decis Mak. 2025 Feb 17;25(1):82. doi: 10.1186/s12911-025-02913-z.

本文引用的文献

1
Privacy-Preserving Deep Learning for the Detection of Protected Health Information in Real-World Data: Comparative Evaluation.用于在真实世界数据中检测受保护健康信息的隐私保护深度学习:比较评估
JMIR Form Res. 2020 May 5;4(5):e14064. doi: 10.2196/14064.
2
Re-examination of Rule-Based Methods in Deidentification of Electronic Health Records: Algorithm Development and Validation.电子健康记录去识别化中基于规则方法的重新审视:算法开发与验证
JMIR Med Inform. 2020 Apr 30;8(4):e17622. doi: 10.2196/17622.
3
De-identification of Clinical Text via Bi-LSTM-CRF with Neural Language Models.
基于神经语言模型的双向长短时记忆条件随机场实现临床文本去识别化
AMIA Annu Symp Proc. 2020 Mar 4;2019:857-863. eCollection 2019.
4
Efficient Active Learning for Electronic Medical Record De-identification.用于电子病历去识别化的高效主动学习
AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:462-471. eCollection 2019.
5
Clinical Named Entity Recognition From Chinese Electronic Health Records via Machine Learning Methods.基于机器学习方法的中文电子健康记录临床命名实体识别
JMIR Med Inform. 2018 Dec 17;6(4):e50. doi: 10.2196/medinform.9965.
6
A machine learning based approach to identify protected health information in Chinese clinical text.基于机器学习的方法识别中文临床文本中的保护健康信息。
Int J Med Inform. 2018 Aug;116:24-32. doi: 10.1016/j.ijmedinf.2018.05.010. Epub 2018 May 22.
7
A cascaded approach for Chinese clinical text de-identification with less annotation effort.一种用于中文临床文本去识别的级联方法,所需标注工作量较少。
J Biomed Inform. 2017 Sep;73:76-83. doi: 10.1016/j.jbi.2017.07.017. Epub 2017 Jul 26.
8
De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1.去识别精神科入院记录:2016 年 CEGS N-GRID 共享任务跟踪 1 概述。
J Biomed Inform. 2017 Nov;75S:S4-S18. doi: 10.1016/j.jbi.2017.06.011. Epub 2017 Jun 11.
9
De-identification of patient notes with recurrent neural networks.使用递归神经网络对患者记录进行去识别化处理。
J Am Med Inform Assoc. 2017 May 1;24(3):596-606. doi: 10.1093/jamia/ocw156.
10
Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.用于去识别化的纵向临床记录标注:2014年i2b2/德克萨斯大学健康科学中心语料库
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.