通过递归神经网络和条件随机场对临床记录进行去识别。

De-identification of clinical notes via recurrent neural network and conditional random field.

机构信息

Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.

出版信息

J Biomed Inform. 2017 Nov;75S:S34-S42. doi: 10.1016/j.jbi.2017.05.023. Epub 2017 Jun 1.

DOI:10.1016/j.jbi.2017.05.023

PMID:28579533

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5705329/

Abstract

De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method.

摘要

去识别，即从数据中识别出身份信息，如临床数据中的受保护健康信息（PHI），是实现数据共享或发布的关键步骤。2016 年基因组科学卓越中心（CEGS）神经精神基因组规模和 RDOC 个体化领域（N-GRID）临床自然语言处理（NLP）挑战赛包含一个去识别电子病历（EMR）的去识别轨道（即轨道 1）。挑战赛组织者为此轨道提供了 1000 个注释的心理健康记录，其中 600 个记录用于训练集，400 个记录用于测试集。我们为训练集上的去识别任务开发了一个混合系统。首先，使用四个独立的子系统来识别 PHI 实例，即基于双向 LSTM（长短时记忆，一种递归神经网络的变体）的子系统、基于带特征的双向 LSTM 的子系统、基于条件随机场（CRF）的子系统和基于规则的子系统。然后，部署基于集成学习的分类器来组合上述三个基于机器学习的子系统预测的所有 PHI 实例。最后，将基于集成学习的分类器和基于规则的子系统的结果合并在一起。在官方测试集上进行的实验表明，我们的系统在“令牌”、“严格”和“二进制令牌”标准下分别实现了 93.07%、91.43%和 95.23%的最高微 F1 得分，在 2016 年 CEGS N-GRID NLP 挑战赛中排名第一。此外，在 2014 年 i2b2 NLP 挑战赛的数据集上，我们的系统在“令牌”、“严格”和“二进制令牌”标准下分别实现了 96.98%、95.11%和 98.28%的最高微 F1 得分，优于其他最先进的系统。所有这些实验都证明了我们提出的方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9fe/5705329/ae5a94bebe98/nihms883177f1.jpg

相似文献

De-identification of clinical notes via recurrent neural network and conditional random field.通过递归神经网络和条件随机场对临床记录进行去识别。

J Biomed Inform. 2017 Nov;75S:S34-S42. doi: 10.1016/j.jbi.2017.05.023. Epub 2017 Jun 1.

Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.使用令牌级和字符级条件随机场对电子病历进行自动去识别。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S47-S52. doi: 10.1016/j.jbi.2015.06.009. Epub 2015 Jun 26.

Entity recognition from clinical texts via recurrent neural network.基于循环神经网络的临床文本实体识别。

BMC Med Inform Decis Mak. 2017 Jul 5;17(Suppl 2):67. doi: 10.1186/s12911-017-0468-7.

De-identification of Clinical Text via Bi-LSTM-CRF with Neural Language Models.基于神经语言模型的双向长短时记忆条件随机场实现临床文本去识别化

AMIA Annu Symp Proc. 2020 Mar 4;2019:857-863. eCollection 2019.

De-identification of medical records using conditional random fields and long short-term memory networks.使用条件随机场和长短时记忆网络对病历进行去识别。

J Biomed Inform. 2017 Nov;75S:S43-S53. doi: 10.1016/j.jbi.2017.10.003. Epub 2017 Oct 13.

The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge.UAB 信息学研究所和 2016 年 CEGS N-GRID 去识别共享任务挑战赛。

J Biomed Inform. 2017 Nov;75S:S54-S61. doi: 10.1016/j.jbi.2017.05.001. Epub 2017 May 3.

De-identification of clinical free text using natural language processing: A systematic review of current approaches.使用自然语言处理对临床自由文本进行去识别化：当前方法的系统评价。

Artif Intell Med. 2024 May;151:102845. doi: 10.1016/j.artmed.2024.102845. Epub 2024 Mar 20.

De-identifying free text of Japanese electronic health records.去标识化日本电子健康记录的自由文本。

J Biomed Semantics. 2020 Sep 21;11(1):11. doi: 10.1186/s13326-020-00227-9.

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.准备一个带注释的金标准语料库，以便与校外研究人员共享用于去识别化研究。

J Biomed Inform. 2014 Aug;50:173-183. doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.用于纵向临床记录去识别化的自动化系统：2014年i2b2/德克萨斯大学健康科学中心共享任务赛道1概述

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.

引用本文的文献

A taxonomy for advancing systematic error analysis in multi-site electronic health record-based clinical concept extraction.一种用于推进基于多站点电子健康记录的临床概念提取中系统误差分析的分类法。

J Am Med Inform Assoc. 2024 Jun 20;31(7):1493-1502. doi: 10.1093/jamia/ocae101.

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.开发和验证一种自然语言处理算法，以在临床数据仓库环境中对文档进行化名处理。

Methods Inf Med. 2024 May;63(1-02):21-34. doi: 10.1055/s-0044-1778693. Epub 2024 Mar 5.

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.自动去除法国电子健康记录中的标识符：一种利用远程监督和深度学习模型的具有成本效益的方法。

BMC Med Inform Decis Mak. 2024 Feb 16;24(1):54. doi: 10.1186/s12911-024-02422-5.

De-identifying Norwegian Clinical Text using Resources from Swedish and Danish.使用瑞典语和丹麦语资源对挪威临床文本进行去识别化处理

AMIA Annu Symp Proc. 2024 Jan 11;2023:456-464. eCollection 2023.

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study.基于规则和转换器的非结构化电子健康记录文本注释的 OpenDeID 管道：去识别算法的开发和验证研究。

J Med Internet Res. 2023 Dec 6;25:e48145. doi: 10.2196/48145.

Man vs the machine in the struggle for effective text anonymisation in the age of large language models.在大语言模型时代，为实现有效的文本匿名化，人与机器展开激烈竞争。

Sci Rep. 2023 Sep 25;13(1):16026. doi: 10.1038/s41598-023-42977-3.

A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records.一项关于电子健康记录中乳腺癌表型自然语言处理算法的跨机构评估。

Comput Struct Biotechnol J. 2023 Aug 22;22:32-40. doi: 10.1016/j.csbj.2023.08.018. eCollection 2023.

Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study.基于人在回路深度学习的电子病历自由文本数据去识别化的网络应用程序：开发与可用性研究

Interact J Med Res. 2023 Aug 25;12:e46322. doi: 10.2196/46322.

Clinical concept and relation extraction using prompt-based machine reading comprehension.基于提示的机器阅读理解的临床概念和关系抽取。

J Am Med Inform Assoc. 2023 Aug 18;30(9):1486-1493. doi: 10.1093/jamia/ocad107.

Enhanced neurologic concept recognition using a named entity recognition model based on transformers.使用基于Transformer的命名实体识别模型增强神经学概念识别。

Front Digit Health. 2022 Dec 8;4:1065581. doi: 10.3389/fdgth.2022.1065581. eCollection 2022.

本文引用的文献

De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1.去识别精神科入院记录：2016 年 CEGS N-GRID 共享任务跟踪 1 概述。

J Biomed Inform. 2017 Nov;75S:S4-S18. doi: 10.1016/j.jbi.2017.06.011. Epub 2017 Jun 11.

De-identification of patient notes with recurrent neural networks.使用递归神经网络对患者记录进行去识别化处理。

J Am Med Inform Assoc. 2017 May 1;24(3):596-606. doi: 10.1093/jamia/ocw156.

Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks.自然语言处理在临床研究中的实际应用：2014年i2b2/德克萨斯大学健康科学中心共享任务

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S1-S5. doi: 10.1016/j.jbi.2015.10.007. Epub 2015 Oct 24.

Hidden Markov model using Dirichlet process for de-identification.使用狄利克雷过程进行去识别的隐马尔可夫模型。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S60-S66. doi: 10.1016/j.jbi.2015.09.004. Epub 2015 Sep 25.

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.用于去识别化的纵向临床记录标注：2014年i2b2/德克萨斯大学健康科学中心语料库

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.

CRFs based de-identification of medical records.基于病例报告表的医疗记录去识别化处理。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S39-S46. doi: 10.1016/j.jbi.2015.08.012. Epub 2015 Aug 24.

Automatic detection of protected health information from clinic narratives.从临床记录中自动检测受保护的健康信息。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S30-S38. doi: 10.1016/j.jbi.2015.06.015. Epub 2015 Jul 29.

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.

Combining knowledge- and data-driven methods for de-identification of clinical narratives.结合知识驱动和数据驱动方法对临床记录进行去识别化处理。

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S53-S59. doi: 10.1016/j.jbi.2015.06.029. Epub 2015 Jul 22.

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S47-S52. doi: 10.1016/j.jbi.2015.06.009. Epub 2015 Jun 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验