Suppr超能文献

通过递归神经网络和条件随机场对临床记录进行去识别。

De-identification of clinical notes via recurrent neural network and conditional random field.

机构信息

Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.

出版信息

J Biomed Inform. 2017 Nov;75S:S34-S42. doi: 10.1016/j.jbi.2017.05.023. Epub 2017 Jun 1.

Abstract

De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method.

摘要

去识别,即从数据中识别出身份信息,如临床数据中的受保护健康信息(PHI),是实现数据共享或发布的关键步骤。2016 年基因组科学卓越中心(CEGS)神经精神基因组规模和 RDOC 个体化领域(N-GRID)临床自然语言处理(NLP)挑战赛包含一个去识别电子病历(EMR)的去识别轨道(即轨道 1)。挑战赛组织者为此轨道提供了 1000 个注释的心理健康记录,其中 600 个记录用于训练集,400 个记录用于测试集。我们为训练集上的去识别任务开发了一个混合系统。首先,使用四个独立的子系统来识别 PHI 实例,即基于双向 LSTM(长短时记忆,一种递归神经网络的变体)的子系统、基于带特征的双向 LSTM 的子系统、基于条件随机场(CRF)的子系统和基于规则的子系统。然后,部署基于集成学习的分类器来组合上述三个基于机器学习的子系统预测的所有 PHI 实例。最后,将基于集成学习的分类器和基于规则的子系统的结果合并在一起。在官方测试集上进行的实验表明,我们的系统在“令牌”、“严格”和“二进制令牌”标准下分别实现了 93.07%、91.43%和 95.23%的最高微 F1 得分,在 2016 年 CEGS N-GRID NLP 挑战赛中排名第一。此外,在 2014 年 i2b2 NLP 挑战赛的数据集上,我们的系统在“令牌”、“严格”和“二进制令牌”标准下分别实现了 96.98%、95.11%和 98.28%的最高微 F1 得分,优于其他最先进的系统。所有这些实验都证明了我们提出的方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9fe/5705329/ae5a94bebe98/nihms883177f1.jpg

相似文献

3
Entity recognition from clinical texts via recurrent neural network.基于循环神经网络的临床文本实体识别。
BMC Med Inform Decis Mak. 2017 Jul 5;17(Suppl 2):67. doi: 10.1186/s12911-017-0468-7.

引用本文的文献

本文引用的文献

4
Hidden Markov model using Dirichlet process for de-identification.使用狄利克雷过程进行去识别的隐马尔可夫模型。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S60-S66. doi: 10.1016/j.jbi.2015.09.004. Epub 2015 Sep 25.
6
CRFs based de-identification of medical records.基于病例报告表的医疗记录去识别化处理。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S39-S46. doi: 10.1016/j.jbi.2015.08.012. Epub 2015 Aug 24.
7
Automatic detection of protected health information from clinic narratives.从临床记录中自动检测受保护的健康信息。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S30-S38. doi: 10.1016/j.jbi.2015.06.015. Epub 2015 Jul 29.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验