Suppr超能文献

改编来自Transformer的双向编码器表征(BERT)以评估临床语义文本相似性:算法开发与验证研究。

Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study.

作者信息

Kades Klaus, Sellner Jan, Koehler Gregor, Full Peter M, Lai T Y Emmy, Kleesiek Jens, Maier-Hein Klaus H

机构信息

German Cancer Research Center (DKFZ), Heidelberg, Germany.

Partner Site Heidelberg, German Cancer Consortium (DKTK), Heidelberg, Germany.

出版信息

JMIR Med Inform. 2021 Feb 3;9(2):e22795. doi: 10.2196/22795.

Abstract

BACKGROUND

Natural Language Understanding enables automatic extraction of relevant information from clinical text data, which are acquired every day in hospitals. In 2018, the language model Bidirectional Encoder Representations from Transformers (BERT) was introduced, generating new state-of-the-art results on several downstream tasks. The National NLP Clinical Challenges (n2c2) is an initiative that strives to tackle such downstream tasks on domain-specific clinical data. In this paper, we present the results of our participation in the 2019 n2c2 and related work completed thereafter.

OBJECTIVE

The objective of this study was to optimally leverage BERT for the task of assessing the semantic textual similarity of clinical text data.

METHODS

We used BERT as an initial baseline and analyzed the results, which we used as a starting point to develop 3 different approaches where we (1) added additional, handcrafted sentence similarity features to the classifier token of BERT and combined the results with more features in multiple regression estimators, (2) incorporated a built-in ensembling method, M-Heads, into BERT by duplicating the regression head and applying an adapted training strategy to facilitate the focus of the heads on different input patterns of the medical sentences, and (3) developed a graph-based similarity approach for medications, which allows extrapolating similarities across known entities from the training set. The approaches were evaluated with the Pearson correlation coefficient between the predicted scores and ground truth of the official training and test dataset.

RESULTS

We improved the performance of BERT on the test dataset from a Pearson correlation coefficient of 0.859 to 0.883 using a combination of the M-Heads method and the graph-based similarity approach. We also show differences between the test and training dataset and how the two datasets influenced the results.

CONCLUSIONS

We found that using a graph-based similarity approach has the potential to extrapolate domain specific knowledge to unseen sentences. We observed that it is easily possible to obtain deceptive results from the test dataset, especially when the distribution of the data samples is different between training and test datasets.

摘要

背景

自然语言理解能够从医院每天获取的临床文本数据中自动提取相关信息。2018年,引入了来自变换器的双向编码器表示(BERT)语言模型,在多个下游任务上产生了新的最优结果。国家NLP临床挑战(n2c2)是一项致力于处理特定领域临床数据上此类下游任务的倡议。在本文中,我们展示了我们参与2019年n2c2的结果以及此后完成的相关工作。

目的

本研究的目的是最佳地利用BERT来评估临床文本数据的语义文本相似性任务。

方法

我们将BERT用作初始基线并分析结果,以此为起点开发了3种不同的方法,其中我们(1)向BERT的分类器令牌添加额外的手工制作的句子相似性特征,并将结果与多元回归估计器中的更多特征相结合;(2)通过复制回归头并应用调整后的训练策略,将内置的集成方法M-Heads纳入BERT,以促进这些头专注于医学句子的不同输入模式;(3)开发了一种基于图的药物相似性方法,该方法允许从训练集中推断已知实体之间的相似性。使用官方训练和测试数据集的预测分数与真实值之间的皮尔逊相关系数对这些方法进行评估。

结果

通过结合M-Heads方法和基于图的相似性方法,我们将测试数据集上BERT的性能从皮尔逊相关系数0.859提高到了0.883。我们还展示了测试数据集和训练数据集之间的差异以及这两个数据集如何影响结果。

结论

我们发现使用基于图的相似性方法有可能将特定领域的知识外推到未见的句子。我们观察到很容易从测试数据集中获得欺骗性结果,尤其是当训练和测试数据集之间的数据样本分布不同时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db9b/7889424/4e1c0139d89f/medinform_v9i2e22795_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验