Yang Xi, He Xing, Zhang Hansi, Ma Yinghan, Bian Jiang, Wu Yonghui
Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, United States.
JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.
Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS.
This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS.
In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models.
Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010).
This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.
语义文本相似性(STS)是自然语言处理(NLP)中的基本任务之一。在通用英语领域已经组织和策划了许多用于STS的共享任务和语料库;然而,在生物医学领域,此类资源有限。2019年,国家NLP临床挑战(n2c2)挑战赛开发了一个全面的临床STS数据集,并组织了社区力量来征集临床STS的最先进解决方案。
本研究展示了我们在此次挑战赛期间开发的基于Transformer的临床STS模型,以及挑战赛之后探索的新模型。该项目是2019年n2c2/开放健康NLP临床STS共享任务的一部分。
在本研究中,我们探索了3种用于临床STS的基于Transformer的模型:来自Transformer的双向编码器表示(BERT)、XLNet和稳健优化的BERT方法(RoBERTa)。我们检查了使用通用英语文本和临床文本预训练的Transformer模型。除了本次挑战赛中开发的临床训练集之外,我们还探索使用通用英语STS数据集作为补充语料库。此外,我们研究了各种集成方法来组合不同的Transformer模型。
我们基于XLNet模型的最佳提交在此次挑战赛中取得了第三好的成绩(皮尔逊相关系数为0.8864)。挑战赛之后,我们进一步探索了其他Transformer模型,并使用RoBERTa模型将性能提高到了0.9065,这超过了此次挑战赛中表现最佳的系统(皮尔逊相关系数为0.9010)。
本研究证明了利用基于Transformer的模型来测量临床文本语义相似性的有效性。我们的模型可应用于临床文本去重和总结等临床应用。