• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

临床文本中语义文本相似度的测量:基于Transformer模型的比较。

Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.

作者信息

Yang Xi, He Xing, Zhang Hansi, Ma Yinghan, Bian Jiang, Wu Yonghui

机构信息

Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, United States.

出版信息

JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.

DOI:10.2196/19735
PMID:33226350
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7721552/
Abstract

BACKGROUND

Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized and curated in the general English domain; however, such resources are limited in the biomedical domain. In 2019, the National NLP Clinical Challenges (n2c2) challenge developed a comprehensive clinical STS dataset and organized a community effort to solicit state-of-the-art solutions for clinical STS.

OBJECTIVE

This study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 n2c2/Open Health NLP shared task on clinical STS.

METHODS

In this study, we explored 3 transformer-based models for clinical STS: Bidirectional Encoder Representations from Transformers (BERT), XLNet, and Robustly optimized BERT approach (RoBERTa). We examined transformer models pretrained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we investigated various ensemble methods to combine different transformer models.

RESULTS

Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After the challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performing system developed in this challenge (Pearson correlation of 0.9010).

CONCLUSIONS

This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text deduplication and summarization.

摘要

背景

语义文本相似性(STS)是自然语言处理(NLP)中的基本任务之一。在通用英语领域已经组织和策划了许多用于STS的共享任务和语料库;然而,在生物医学领域,此类资源有限。2019年,国家NLP临床挑战(n2c2)挑战赛开发了一个全面的临床STS数据集,并组织了社区力量来征集临床STS的最先进解决方案。

目的

本研究展示了我们在此次挑战赛期间开发的基于Transformer的临床STS模型,以及挑战赛之后探索的新模型。该项目是2019年n2c2/开放健康NLP临床STS共享任务的一部分。

方法

在本研究中,我们探索了3种用于临床STS的基于Transformer的模型:来自Transformer的双向编码器表示(BERT)、XLNet和稳健优化的BERT方法(RoBERTa)。我们检查了使用通用英语文本和临床文本预训练的Transformer模型。除了本次挑战赛中开发的临床训练集之外,我们还探索使用通用英语STS数据集作为补充语料库。此外,我们研究了各种集成方法来组合不同的Transformer模型。

结果

我们基于XLNet模型的最佳提交在此次挑战赛中取得了第三好的成绩(皮尔逊相关系数为0.8864)。挑战赛之后,我们进一步探索了其他Transformer模型,并使用RoBERTa模型将性能提高到了0.9065,这超过了此次挑战赛中表现最佳的系统(皮尔逊相关系数为0.9010)。

结论

本研究证明了利用基于Transformer的模型来测量临床文本语义相似性的有效性。我们的模型可应用于临床文本去重和总结等临床应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/24fccca13729/medinform_v8i11e19735_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/f19e4687828d/medinform_v8i11e19735_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/7ffe8d3681a2/medinform_v8i11e19735_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/24fccca13729/medinform_v8i11e19735_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/f19e4687828d/medinform_v8i11e19735_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/7ffe8d3681a2/medinform_v8i11e19735_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f4e/7721552/24fccca13729/medinform_v8i11e19735_fig3.jpg

相似文献

1
Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.临床文本中语义文本相似度的测量:基于Transformer模型的比较。
JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.
2
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
3
Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning.临床笔记中语义相似句子的识别:使用多任务学习的迭代中间训练
JMIR Med Inform. 2020 Nov 27;8(11):e22508. doi: 10.2196/22508.
4
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
5
Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.基于 BERT 的日本临床领域文本的语义文本相似性研究
Methods Inf Med. 2021 Jun;60(S 01):e56-e64. doi: 10.1055/s-0041-1731390. Epub 2021 Jul 8.
6
Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study.使用字符级和实体级表示来增强基于Transformer的临床语义文本相似性模型的双向编码器表示:临床STS建模研究
JMIR Med Inform. 2020 Dec 29;8(12):e23357. doi: 10.2196/23357.
7
Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison.通过使用图卷积网络将领域知识融入语言模型以评估语义文本相似度:模型开发与性能比较
JMIR Med Inform. 2021 Nov 26;9(11):e23101. doi: 10.2196/23101.
8
Clinical concept extraction using transformers.使用转换器进行临床概念提取。
J Am Med Inform Assoc. 2020 Dec 9;27(12):1935-1942. doi: 10.1093/jamia/ocaa189.
9
Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study.改编来自Transformer的双向编码器表征(BERT)以评估临床语义文本相似性:算法开发与验证研究。
JMIR Med Inform. 2021 Feb 3;9(2):e22795. doi: 10.2196/22795.
10
Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study.临床领域语义文本相似度深度学习模型的有效性和效率基准测试:验证研究
JMIR Med Inform. 2021 Dec 30;9(12):e27386. doi: 10.2196/27386.

引用本文的文献

1
Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.使用来自Transformer句子嵌入的语言无关双向编码器表示法检测冗余健康调查问题:算法开发研究
JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.
2
Transformers and large language models in healthcare: A review.医疗保健中的变压器和大型语言模型:综述。
Artif Intell Med. 2024 Aug;154:102900. doi: 10.1016/j.artmed.2024.102900. Epub 2024 Jun 5.
3
Identifying the Question Similarity of Regulatory Documents in the Pharmaceutical Industry by Using the Recognizing Question Entailment System: Evaluation Study.

本文引用的文献

1
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
2
Distributed representation and one-hot representation fusion with gated network for clinical semantic textual similarity.基于门控网络的分布式表示和独热表示融合用于临床语义文本相似度。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):72. doi: 10.1186/s12911-020-1045-z.
3
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.
利用识别问题蕴含系统识别制药行业监管文件中的问题相似性:评估研究
JMIR AI. 2023 Sep 26;2:e43483. doi: 10.2196/43483.
4
Language model and its interpretability in biomedicine: A scoping review.语言模型及其在生物医学中的可解释性:一项范围综述。
iScience. 2024 Feb 24;27(4):109334. doi: 10.1016/j.isci.2024.109334. eCollection 2024 Apr 19.
5
Identify diabetic retinopathy-related clinical concepts and their attributes using transformer-based natural language processing methods.使用基于转换器的自然语言处理方法识别与糖尿病视网膜病变相关的临床概念及其属性。
BMC Med Inform Decis Mak. 2022 Sep 27;22(Suppl 3):255. doi: 10.1186/s12911-022-01996-2.
6
Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.基于迁移学习的现代标准阿拉伯语和方言的语义文本相似度研究。
PLoS One. 2022 Aug 11;17(8):e0272991. doi: 10.1371/journal.pone.0272991. eCollection 2022.
7
A Study of Social and Behavioral Determinants of Health in Lung Cancer Patients Using Transformers-based Natural Language Processing Models.基于变压器的自然语言处理模型研究肺癌患者健康的社会和行为决定因素。
AMIA Annu Symp Proc. 2022 Feb 21;2021:1225-1233. eCollection 2021.
8
An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering.一种用于大规模生物医学文档聚类的基于有效并行化本体网络的语义相似度度量方法。
Comput Math Methods Med. 2021 Nov 9;2021:7937573. doi: 10.1155/2021/7937573. eCollection 2021.
基于生物医学语料库预训练的句子嵌入的深度学习提高了在电子病历中查找相似句子的性能。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.
4
Evaluating sentence representations for biomedical text: Methods and experimental results.评价生物医学文本的句子表示方法及实验结果。
J Biomed Inform. 2020 Apr;104:103396. doi: 10.1016/j.jbi.2020.103396. Epub 2020 Mar 6.
5
SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0:Python 中的科学计算基础算法。
Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.
6
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
7
Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。
J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.
8
A survey of practices for the use of electronic health records to support research recruitment.一项关于使用电子健康记录支持研究招募的实践调查。
J Clin Transl Sci. 2017 Aug;1(4):246-252. doi: 10.1017/cts.2017.301.
9
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.BIOSSES:一种用于生物医学领域的语义句子相似度估计系统。
Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.
10
Characterizing the Source of Text in Electronic Health Record Progress Notes.电子健康记录进展记录中文本来源的特征描述。
JAMA Intern Med. 2017 Aug 1;177(8):1212-1213. doi: 10.1001/jamainternmed.2017.1548.