Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.
J Biomed Inform. 2017 Nov;75S:S34-S42. doi: 10.1016/j.jbi.2017.05.023. Epub 2017 Jun 1.
De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method.
去识别,即从数据中识别出身份信息,如临床数据中的受保护健康信息(PHI),是实现数据共享或发布的关键步骤。2016 年基因组科学卓越中心(CEGS)神经精神基因组规模和 RDOC 个体化领域(N-GRID)临床自然语言处理(NLP)挑战赛包含一个去识别电子病历(EMR)的去识别轨道(即轨道 1)。挑战赛组织者为此轨道提供了 1000 个注释的心理健康记录,其中 600 个记录用于训练集,400 个记录用于测试集。我们为训练集上的去识别任务开发了一个混合系统。首先,使用四个独立的子系统来识别 PHI 实例,即基于双向 LSTM(长短时记忆,一种递归神经网络的变体)的子系统、基于带特征的双向 LSTM 的子系统、基于条件随机场(CRF)的子系统和基于规则的子系统。然后,部署基于集成学习的分类器来组合上述三个基于机器学习的子系统预测的所有 PHI 实例。最后,将基于集成学习的分类器和基于规则的子系统的结果合并在一起。在官方测试集上进行的实验表明,我们的系统在“令牌”、“严格”和“二进制令牌”标准下分别实现了 93.07%、91.43%和 95.23%的最高微 F1 得分,在 2016 年 CEGS N-GRID NLP 挑战赛中排名第一。此外,在 2014 年 i2b2 NLP 挑战赛的数据集上,我们的系统在“令牌”、“严格”和“二进制令牌”标准下分别实现了 96.98%、95.11%和 98.28%的最高微 F1 得分,优于其他最先进的系统。所有这些实验都证明了我们提出的方法的有效性。