• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

临床笔记中语义相似句子的识别:使用多任务学习的迭代中间训练

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning.

作者信息

Mahajan Diwakar, Poddar Ananya, Liang Jennifer J, Lin Yen-Ting, Prager John M, Suryanarayanan Parthasarathy, Raghavan Preethi, Tsou Ching-Huei

机构信息

IBM Research, Yorktown Heights, NY, United States.

National Taiwan University, Taipei, Taiwan.

出版信息

JMIR Med Inform. 2020 Nov 27;8(11):e22508. doi: 10.2196/22508.

DOI:10.2196/22508
PMID:33245284
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7732709/
Abstract

BACKGROUND

Although electronic health records (EHRs) have been widely adopted in health care, effective use of EHR data is often limited because of redundant information in clinical notes introduced by the use of templates and copy-paste during note generation. Thus, it is imperative to develop solutions that can condense information while retaining its value. A step in this direction is measuring the semantic similarity between clinical text snippets. To address this problem, we participated in the 2019 National NLP Clinical Challenges (n2c2)/Open Health Natural Language Processing Consortium (OHNLP) clinical semantic textual similarity (ClinicalSTS) shared task.

OBJECTIVE

This study aims to improve the performance and robustness of semantic textual similarity in the clinical domain by leveraging manually labeled data from related tasks and contextualized embeddings from pretrained transformer-based language models.

METHODS

The ClinicalSTS data set consists of 1642 pairs of deidentified clinical text snippets annotated in a continuous scale of 0-5, indicating degrees of semantic similarity. We developed an iterative intermediate training approach using multi-task learning (IIT-MTL), a multi-task training approach that employs iterative data set selection. We applied this process to bidirectional encoder representations from transformers on clinical text mining (ClinicalBERT), a pretrained domain-specific transformer-based language model, and fine-tuned the resulting model on the target ClinicalSTS task. We incrementally ensembled the output from applying IIT-MTL on ClinicalBERT with the output of other language models (bidirectional encoder representations from transformers for biomedical text mining [BioBERT], multi-task deep neural networks [MT-DNN], and robustly optimized BERT approach [RoBERTa]) and handcrafted features using regression-based learning algorithms. On the basis of these experiments, we adopted the top-performing configurations as our official submissions.

RESULTS

Our system ranked first out of 87 submitted systems in the 2019 n2c2/OHNLP ClinicalSTS challenge, achieving state-of-the-art results with a Pearson correlation coefficient of 0.9010. This winning system was an ensembled model leveraging the output of IIT-MTL on ClinicalBERT with BioBERT, MT-DNN, and handcrafted medication features.

CONCLUSIONS

This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited data set. This contribution opens new avenues of exploration for optimized data set selection to generate more robust and universal contextual representations of text in the clinical domain.

摘要

背景

尽管电子健康记录(EHR)已在医疗保健领域广泛应用,但由于在病历生成过程中使用模板和复制粘贴导致临床记录中存在冗余信息,EHR数据的有效利用往往受到限制。因此,开发能够浓缩信息同时保留其价值的解决方案势在必行。朝着这个方向迈出的一步是测量临床文本片段之间的语义相似度。为了解决这个问题,我们参加了2019年全国自然语言处理临床挑战(n2c2)/开放健康自然语言处理联盟(OHNLP)临床语义文本相似度(ClinicalSTS)共享任务。

目的

本研究旨在通过利用相关任务的人工标注数据和基于预训练的基于Transformer的语言模型的上下文嵌入,提高临床领域语义文本相似度的性能和鲁棒性。

方法

ClinicalSTS数据集由1642对去标识化的临床文本片段组成,这些片段以0 - 5的连续尺度进行标注,表明语义相似程度。我们开发了一种使用多任务学习的迭代中间训练方法(IIT - MTL),这是一种采用迭代数据集选择的多任务训练方法。我们将此过程应用于临床文本挖掘的Transformer双向编码器表示(ClinicalBERT),这是一种预训练的特定领域基于Transformer的语言模型,并在目标ClinicalSTS任务上对所得模型进行微调。我们逐步将在ClinicalBERT上应用IIT - MTL的输出与其他语言模型(生物医学文本挖掘的Transformer双向编码器表示[BioBERT]、多任务深度神经网络[MT - DNN]和稳健优化的BERT方法[RoBERTa])的输出以及使用基于回归的学习算法的手工制作特征进行集成。基于这些实验,我们采用表现最佳的配置作为我们的正式提交。

结果

在2019年n2c2/OHNLP ClinicalSTS挑战中,我们的系统在87个提交系统中排名第一,以0.9010的皮尔逊相关系数取得了领先成果。这个获胜系统是一个集成模型,它利用了在ClinicalBERT上应用IIT - MTL与BioBERT、MT - DNN以及手工制作的药物特征的输出。

结论

本研究表明,IIT - MTL是利用相关任务的标注数据来提高在有限数据集上目标任务性能的有效方法。这一贡献为优化数据集选择开辟了新的探索途径,以生成临床领域中文本更稳健和通用的上下文表示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/36267a3a0425/medinform_v8i11e22508_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/f6cce9af3fc9/medinform_v8i11e22508_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/17f81196741b/medinform_v8i11e22508_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/24d10cf07097/medinform_v8i11e22508_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/36267a3a0425/medinform_v8i11e22508_fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/f6cce9af3fc9/medinform_v8i11e22508_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/17f81196741b/medinform_v8i11e22508_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/24d10cf07097/medinform_v8i11e22508_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/082f/7732709/36267a3a0425/medinform_v8i11e22508_fig4.jpg

相似文献

1
Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning.临床笔记中语义相似句子的识别:使用多任务学习的迭代中间训练
JMIR Med Inform. 2020 Nov 27;8(11):e22508. doi: 10.2196/22508.
2
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
3
Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study.使用字符级和实体级表示来增强基于Transformer的临床语义文本相似性模型的双向编码器表示:临床STS建模研究
JMIR Med Inform. 2020 Dec 29;8(12):e23357. doi: 10.2196/23357.
4
Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison.通过使用图卷积网络将领域知识融入语言模型以评估语义文本相似度:模型开发与性能比较
JMIR Med Inform. 2021 Nov 26;9(11):e23101. doi: 10.2196/23101.
5
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
6
Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.临床文本中语义文本相似度的测量:基于Transformer模型的比较。
JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.
7
Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study.临床领域语义文本相似度深度学习模型的有效性和效率基准测试:验证研究
JMIR Med Inform. 2021 Dec 30;9(12):e27386. doi: 10.2196/27386.
8
Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study.改编来自Transformer的双向编码器表征(BERT)以评估临床语义文本相似性:算法开发与验证研究。
JMIR Med Inform. 2021 Feb 3;9(2):e22795. doi: 10.2196/22795.
9
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
10
Identifying the Question Similarity of Regulatory Documents in the Pharmaceutical Industry by Using the Recognizing Question Entailment System: Evaluation Study.利用识别问题蕴含系统识别制药行业监管文件中的问题相似性:评估研究
JMIR AI. 2023 Sep 26;2:e43483. doi: 10.2196/43483.

引用本文的文献

1
Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.使用来自Transformer句子嵌入的语言无关双向编码器表示法检测冗余健康调查问题:算法开发研究
JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.
2
Language model and its interpretability in biomedicine: A scoping review.语言模型及其在生物医学中的可解释性:一项范围综述。
iScience. 2024 Feb 24;27(4):109334. doi: 10.1016/j.isci.2024.109334. eCollection 2024 Apr 19.
3
BERT-Based Neural Network for Inpatient Fall Detection From Electronic Medical Records: Retrospective Cohort Study.

本文引用的文献

1
MT-clinical BERT: scaling clinical information extraction with multitask learning.MT-clinical BERT:基于多任务学习的临床信息提取扩展。
J Am Med Inform Assoc. 2021 Sep 18;28(10):2108-2115. doi: 10.1093/jamia/ocab126.
2
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
3
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
基于BERT的神经网络用于从电子病历中检测住院患者跌倒:回顾性队列研究
JMIR Med Inform. 2024 Jan 30;12:e48995. doi: 10.2196/48995.
4
Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review.将自然语言处理应用于临床数据仓库中的文本数据:系统评价。
JMIR Med Inform. 2023 Dec 15;11:e42477. doi: 10.2196/42477.
5
A review on Natural Language Processing Models for COVID-19 research.关于用于新冠病毒研究的自然语言处理模型的综述。
Healthc Anal (N Y). 2022 Nov;2:100078. doi: 10.1016/j.health.2022.100078. Epub 2022 Jul 19.
6
A hybrid system to understand the relations between assessments and plans in progress notes.一种混合系统,用于理解在进行中的记录中的评估和计划之间的关系。
J Biomed Inform. 2023 May;141:104363. doi: 10.1016/j.jbi.2023.104363. Epub 2023 Apr 11.
7
Precision information extraction for rare disease epidemiology at scale.大规模罕见病流行病学的精确信息提取。
J Transl Med. 2023 Feb 28;21(1):157. doi: 10.1186/s12967-023-04011-y.
8
A large language model for electronic health records.用于电子健康记录的大型语言模型。
NPJ Digit Med. 2022 Dec 26;5(1):194. doi: 10.1038/s41746-022-00742-2.
9
Reducing Physicians' Cognitive Load During Chart Review: A Problem-Oriented Summary of the Patient Electronic Record.在病历回顾期间降低医生的认知负荷:患者电子病历的以问题为导向的总结。
AMIA Annu Symp Proc. 2022 Feb 21;2021:763-772. eCollection 2021.
10
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
4
Association of Electronic Health Record Design and Use Factors With Clinician Stress and Burnout.电子健康记录设计和使用因素与临床医生压力和倦怠的关联。
JAMA Netw Open. 2019 Aug 2;2(8):e199609. doi: 10.1001/jamanetworkopen.2019.9609.
5
Characterizing the Source of Text in Electronic Health Record Progress Notes.电子健康记录进展记录中文本来源的特征描述。
JAMA Intern Med. 2017 Aug 1;177(8):1212-1213. doi: 10.1001/jamainternmed.2017.1548.
6
Recognizing Question Entailment for Medical Question Answering.识别医学问答中的问题蕴含关系。
AMIA Annu Symp Proc. 2017 Feb 10;2016:310-318. eCollection 2016.
7
MIMIC-III, a freely accessible critical care database.MIMIC-III,一个免费获取的重症监护数据库。
Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.
8
EHR adopters vs. non-adopters: Impacts of, barriers to, and federal initiatives for EHR adoption.电子健康档案采用者与非采用者:电子健康档案采用的影响、障碍和联邦举措。
Healthc (Amst). 2014 Mar;2(1):33-9. doi: 10.1016/j.hjdsi.2013.12.004. Epub 2014 Mar 18.
9
Cut-and-paste clinical notes confuse care, say US internists.美国内科医生表示,复制粘贴的临床记录会干扰医疗护理。
CMAJ. 2013 Dec 10;185(18):E826. doi: 10.1503/cmaj.109-4656. Epub 2013 Nov 11.
10
Association of Medical Directors of Information Systems consensus on inpatient electronic health record documentation.信息系统医学主任协会关于住院患者电子健康记录文档的共识。
Appl Clin Inform. 2013 Jun 26;4(2):293-303. doi: 10.4338/ACI-2013-02-R-0012. Print 2013.