使用字符级和实体级表示来增强基于Transformer的临床语义文本相似性模型的双向编码器表示：临床STS建模研究

Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study.

作者信息

Xiong Ying, Chen Shuai, Chen Qingcai, Yan Jun, Tang Buzhou

机构信息

Harbin Institute of Technology, Shenzhen, China.

Peng Cheng Laboratory, Shenzhen, China.

出版信息

JMIR Med Inform. 2020 Dec 29;8(12):e23357. doi: 10.2196/23357.

DOI:10.2196/23357

PMID:33372664

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7803475/

Abstract

BACKGROUND

With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets.

OBJECTIVE

In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results.

METHODS

We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II).

RESULTS

We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II).

CONCLUSIONS

Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.

摘要

背景

随着电子健康记录（EHR）的普及，医疗保健质量得到了提高。然而，EHR也带来了一些问题，比如复制粘贴和模板的使用日益增多，导致EHR内容质量低下。为了尽量减少不同文档中的数据冗余，哈佛医学院和梅奥诊所于2019年组织了一场关于临床语义文本相似度（ClinicalSTS）的全国性自然语言处理（NLP）临床挑战赛（n2c2）。该挑战赛的任务是计算临床文本片段之间的语义相似度。

目的

在本研究中，我们旨在探索为ClinicalSTS建模的新方法并分析结果。

方法

我们为2019年n2c2/开放健康NLP（OHNLP）关于ClinicalSTS的挑战赛提出了一种语义增强的文本匹配模型。该模型包括3个表示模块，用于在不同层次上对临床文本片段对进行编码：（1）基于卷积神经网络（CNN）的字符级表示模块，以解决NLP中的词汇外问题；（2）句子级表示模块，采用预训练语言模型双向编码器表征来自变换器（BERT）对临床文本片段对进行编码；（3）实体级表示模块，用于对临床文本片段中的临床实体信息进行建模。在实体级表示的情况下，我们比较了两种方法。一种通过与文本片段对应的实体类型标签序列对实体进行编码（称为实体I），而另一种通过实体在医学领域知识图谱MeSH中的表示对实体进行编码（称为实体II）。

结果

我们在2019年n2c2/OHNLP挑战赛的ClinicalSTS语料库上进行实验以评估模型性能。仅使用BERT对文本片段对进行编码的模型的皮尔逊相关系数（PCC）为0.848。当将字符级表示和实体级表示分别添加到我们的模型中时，PCC分别提高到0.857和0.854（实体I）/0.859（实体II）。当同时将字符级表示和实体级表示添加到我们的模型中时，PCC进一步提高到0.861（实体I）和0.868（实体II）。