• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于转换器模型的双向编码器表示的多方面自然语言处理任务评估在双语(韩语和英语)临床笔记中的应用:算法开发和验证。

Multifaceted Natural Language Processing Task-Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation.

机构信息

Interdisciplinary Program for Bioengineering, Seoul National University, Seoul, Republic of Korea.

Seoul National University Medical Research Center, Seoul, Republic of Korea.

出版信息

JMIR Med Inform. 2024 Oct 30;12:e52897. doi: 10.2196/52897.

DOI:10.2196/52897
PMID:39475725
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11539635/
Abstract

BACKGROUND

The bidirectional encoder representations from transformers (BERT) model has attracted considerable attention in clinical applications, such as patient classification and disease prediction. However, current studies have typically progressed to application development without a thorough assessment of the model's comprehension of clinical context. Furthermore, limited comparative studies have been conducted on BERT models using medical documents from non-English-speaking countries. Therefore, the applicability of BERT models trained on English clinical notes to non-English contexts is yet to be confirmed. To address these gaps in literature, this study focused on identifying the most effective BERT model for non-English clinical notes.

OBJECTIVE

In this study, we evaluated the contextual understanding abilities of various BERT models applied to mixed Korean and English clinical notes. The objective of this study was to identify the BERT model that excels in understanding the context of such documents.

METHODS

Using data from 164,460 patients in a South Korean tertiary hospital, we pretrained BERT-base, BERT for Biomedical Text Mining (BioBERT), Korean BERT (KoBERT), and Multilingual BERT (M-BERT) to improve their contextual comprehension capabilities and subsequently compared their performances in 7 fine-tuning tasks.

RESULTS

The model performance varied based on the task and token usage. First, BERT-base and BioBERT excelled in tasks using classification ([CLS]) token embeddings, such as document classification. BioBERT achieved the highest F1-score of 89.32. Both BERT-base and BioBERT demonstrated their effectiveness in document pattern recognition, even with limited Korean tokens in the dictionary. Second, M-BERT exhibited a superior performance in reading comprehension tasks, achieving an F1-score of 93.77. Better results were obtained when fewer words were replaced with unknown ([UNK]) tokens. Third, M-BERT excelled in the knowledge inference task in which correct disease names were inferred from 63 candidate disease names in a document with disease names replaced with [MASK] tokens. M-BERT achieved the highest hit@10 score of 95.41.

CONCLUSIONS

This study highlighted the effectiveness of various BERT models in a multilingual clinical domain. The findings can be used as a reference in clinical and language-based applications.

摘要

背景

基于转换器的双向编码器表示(BERT)模型在临床应用中引起了广泛关注,例如患者分类和疾病预测。然而,目前的研究通常在没有彻底评估模型对临床环境的理解能力的情况下就进行应用开发。此外,针对使用非英语国家的医学文献的 BERT 模型的比较研究也很有限。因此,在英语临床记录上训练的 BERT 模型在非英语环境下的适用性仍有待确认。为了解决文献中的这些差距,本研究专注于确定最适合非英语临床记录的 BERT 模型。

目的

在这项研究中,我们评估了应用于混合韩语和英语临床记录的各种 BERT 模型的上下文理解能力。本研究的目的是确定在理解此类文档方面表现出色的 BERT 模型。

方法

使用来自韩国一家三级医院的 164460 名患者的数据,我们对 BERT-base、生物医学文本挖掘用 BERT(BioBERT)、韩语 BERT(KoBERT)和多语言 BERT(M-BERT)进行了预训练,以提高它们的上下文理解能力,然后比较了它们在 7 个微调任务中的表现。

结果

模型性能因任务和标记使用情况而异。首先,BERT-base 和 BioBERT 在使用分类([CLS])标记嵌入的任务中表现出色,例如文档分类。BioBERT 获得了 89.32 的最高 F1 分数。BERT-base 和 BioBERT 都在文档模式识别方面表现出色,即使字典中韩语标记有限。其次,M-BERT 在阅读理解任务中的表现优于其他模型,其 F1 分数为 93.77。当用未知([UNK])标记替换的单词较少时,结果会更好。第三,M-BERT 在知识推理任务中表现出色,该任务根据文档中用 [MASK] 标记替换的疾病名称,从 63 个候选疾病名称中推断出正确的疾病名称。M-BERT 获得了 95.41 的最高命中@10 分数。

结论

本研究强调了各种 BERT 模型在多语言临床领域的有效性。这些发现可作为临床和基于语言的应用的参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/1907dd98cba0/medinform-v12-e52897-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/b0e781ef090c/medinform-v12-e52897-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/822e1b9ee88a/medinform-v12-e52897-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/5a2a2bf6ff81/medinform-v12-e52897-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/451bf8171b72/medinform-v12-e52897-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/1907dd98cba0/medinform-v12-e52897-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/b0e781ef090c/medinform-v12-e52897-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/822e1b9ee88a/medinform-v12-e52897-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/5a2a2bf6ff81/medinform-v12-e52897-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/451bf8171b72/medinform-v12-e52897-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55df/11539635/1907dd98cba0/medinform-v12-e52897-g005.jpg

相似文献

1
Multifaceted Natural Language Processing Task-Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation.基于转换器模型的双向编码器表示的多方面自然语言处理任务评估在双语(韩语和英语)临床笔记中的应用:算法开发和验证。
JMIR Med Inform. 2024 Oct 30;12:e52897. doi: 10.2196/52897.
2
Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征(BERT)模型进行微调:一项实证研究。
JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.
3
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
4
Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
5
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
6
A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study.基于荷兰全科电子健康记录的 COVID-19 检测自然语言处理模型:使用转换器的双向编码器表示进行开发和验证研究。
J Med Internet Res. 2023 Oct 4;25:e49944. doi: 10.2196/49944.
7
Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT.在基于转换器的双向编码器表示预训练(BERT)中进行过采样,以定位医学 BERT 并增强生物医学 BERT。
Artif Intell Med. 2024 Jul;153:102889. doi: 10.1016/j.artmed.2024.102889. Epub 2024 May 5.
8
Modified Bidirectional Encoder Representations From Transformers Extractive Summarization Model for Hospital Information Systems Based on Character-Level Tokens (AlphaBERT): Development and Performance Evaluation.基于字符级令牌的医院信息系统变压器抽取式摘要模型(AlphaBERT)的改进双向编码器表示:开发与性能评估
JMIR Med Inform. 2020 Apr 29;8(4):e17787. doi: 10.2196/17787.
9
BioBERT and Similar Approaches for Relation Extraction.BioBERT 及其在关系抽取中的应用。
Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.
10
When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博:预训练语言模型在疾病分类上的学习曲线分析。
BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

引用本文的文献

1
How electronic health literacy influences physical activity behaviour among university students: A moderated mediation model.电子健康素养如何影响大学生的体育活动行为:一个有调节的中介模型。
PLoS One. 2025 Aug 29;20(8):e0330637. doi: 10.1371/journal.pone.0330637. eCollection 2025.

本文引用的文献

1
Embracing Large Language Models for Medical Applications: Opportunities and Challenges.拥抱用于医学应用的大语言模型:机遇与挑战。
Cureus. 2023 May 21;15(5):e39305. doi: 10.7759/cureus.39305. eCollection 2023 May.
2
Extracting social determinants of health events with transformer-based multitask, multilabel named entity recognition.基于转换器的多任务、多标签命名实体识别技术提取健康事件的社会决定因素。
J Am Med Inform Assoc. 2023 Jul 19;30(8):1379-1388. doi: 10.1093/jamia/ocad046.
3
The 2022 n2c2/UW shared task on extracting social determinants of health.
2022 年 n2c2/UW 关于提取健康社会决定因素的共享任务。
J Am Med Inform Assoc. 2023 Jul 19;30(8):1367-1378. doi: 10.1093/jamia/ocad012.
4
A survey on clinical natural language processing in the United Kingdom from 2007 to 2022.2007年至2022年英国临床自然语言处理调查。
NPJ Digit Med. 2022 Dec 21;5(1):186. doi: 10.1038/s41746-022-00730-6.
5
Protected Health Information Recognition by Fine-Tuning a Pre-training Transformer Model.通过微调预训练的Transformer模型来识别受保护的健康信息。
Healthc Inform Res. 2022 Jan;28(1):16-24. doi: 10.4258/hir.2022.28.1.16. Epub 2022 Jan 31.
6
Natural language inference for curation of structured clinical registries from unstructured text.从非结构化文本中进行结构化临床注册管理的自然语言推理。
J Am Med Inform Assoc. 2021 Dec 28;29(1):97-108. doi: 10.1093/jamia/ocab243.
7
Biomedical and clinical English model packages for the Stanza Python NLP library.适用于Stanza Python自然语言处理库的生物医学和临床英语模型包。
J Am Med Inform Assoc. 2021 Aug 13;28(9):1892-1899. doi: 10.1093/jamia/ocab090.
8
Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.医学BERT:基于大规模结构化电子健康记录进行疾病预测的预训练上下文嵌入模型
NPJ Digit Med. 2021 May 20;4(1):86. doi: 10.1038/s41746-021-00455-y.
9
Clinical concept extraction using transformers.使用转换器进行临床概念提取。
J Am Med Inform Assoc. 2020 Dec 9;27(12):1935-1942. doi: 10.1093/jamia/ocaa189.
10
Question-driven summarization of answers to consumer health questions.面向消费者健康问题答案的问题驱动式总结。
Sci Data. 2020 Oct 2;7(1):322. doi: 10.1038/s41597-020-00667-z.