• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 BERT 的日本临床领域文本的语义文本相似性研究

Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.

机构信息

Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Nara, Japan.

出版信息

Methods Inf Med. 2021 Jun;60(S 01):e56-e64. doi: 10.1055/s-0041-1731390. Epub 2021 Jul 8.

DOI:10.1055/s-0041-1731390
PMID:34237783
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8294940/
Abstract

BACKGROUND

Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese.

OBJECTIVE

The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available.

MATERIALS

We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records.

METHODS

We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts.

RESULTS

The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT.

摘要

背景

语义文本相似度(STS)捕捉文本之间的语义相似程度。它在许多自然语言处理应用中起着重要作用,例如文本摘要、问答、机器翻译、信息检索、对话系统、剽窃检测和查询排名。STS 在一般英语领域得到了广泛的研究。然而,在临床领域和英语以外的语言(如日语)中,STS 任务的资源很少。

目的

本研究的目的是通过创建一个公开可用的日语数据集来捕捉日语临床文本之间的语义相似性。

材料

我们创建了两个用于日语临床 STS 的数据集:(1)日语病例报告(CR 数据集)和(2)日语电子病历(EMR 数据集)。CR 数据集是从 CiNii 数据库中提取的公开病例报告创建的。EMR 数据集是从日语电子病历创建的。

方法

我们使用基于转换器的双向编码器表示(BERT)的方法来捕捉临床领域文本之间的语义相似性。BERT 是一种流行的迁移学习方法,已被证明在处理小数据集时可以达到高精度。我们实现了两个日语预训练 BERT 模型:一个通用日语 BERT 和一个临床日语 BERT。通用日语 BERT 是在日语维基百科文本上预训练的,而临床日语 BERT 是在日语临床文本上预训练的。

结果

BERT 模型在我们的数据集上表现良好,能够很好地捕捉语义相似性。通用日语 BERT 优于临床日语 BERT,与人工评分高度相关(CR 数据集为 0.904,EMR 数据集为 0.875)。令人意外的是,通用日语 BERT 在临床领域数据集上的表现优于临床日语 BERT。这可能是因为通用日语 BERT 与临床日语 BERT 相比,在广泛的文本上进行了预训练。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/46bd346d37f1/10-1055-s-0041-1731390-i21010015-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/e9bdf67a4e59/10-1055-s-0041-1731390-i21010015-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/7bac74999991/10-1055-s-0041-1731390-i21010015-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/d5cb1e8f7794/10-1055-s-0041-1731390-i21010015-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/46bd346d37f1/10-1055-s-0041-1731390-i21010015-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/e9bdf67a4e59/10-1055-s-0041-1731390-i21010015-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/7bac74999991/10-1055-s-0041-1731390-i21010015-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/d5cb1e8f7794/10-1055-s-0041-1731390-i21010015-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/557e/8294940/46bd346d37f1/10-1055-s-0041-1731390-i21010015-4.jpg

相似文献

1
Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.基于 BERT 的日本临床领域文本的语义文本相似性研究
Methods Inf Med. 2021 Jun;60(S 01):e56-e64. doi: 10.1055/s-0041-1731390. Epub 2021 Jul 8.
2
Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.临床文本中语义文本相似度的测量:基于Transformer模型的比较。
JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.
3
An Evaluation of Pretrained BERT Models for Comparing Semantic Similarity Across Unstructured Clinical Trial Texts.基于预训练 BERT 模型评估非结构化临床试验文本间语义相似度的比较
Stud Health Technol Inform. 2022 Jan 14;289:18-21. doi: 10.3233/SHTI210848.
4
Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT.在基于转换器的双向编码器表示预训练(BERT)中进行过采样,以定位医学 BERT 并增强生物医学 BERT。
Artif Intell Med. 2024 Jul;153:102889. doi: 10.1016/j.artmed.2024.102889. Epub 2024 May 5.
5
A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.基于 Transformer 的双向编码器表示模型的精细调整在食品命名实体识别中的应用:算法开发与验证。
J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.
6
Adapting Bidirectional Encoder Representations from Transformers (BERT) to Assess Clinical Semantic Textual Similarity: Algorithm Development and Validation Study.改编来自Transformer的双向编码器表征(BERT)以评估临床语义文本相似性:算法开发与验证研究。
JMIR Med Inform. 2021 Feb 3;9(2):e22795. doi: 10.2196/22795.
7
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
8
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
9
Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison.通过使用图卷积网络将领域知识融入语言模型以评估语义文本相似度:模型开发与性能比较
JMIR Med Inform. 2021 Nov 26;9(11):e23101. doi: 10.2196/23101.
10
Identifying the Question Similarity of Regulatory Documents in the Pharmaceutical Industry by Using the Recognizing Question Entailment System: Evaluation Study.利用识别问题蕴含系统识别制药行业监管文件中的问题相似性:评估研究
JMIR AI. 2023 Sep 26;2:e43483. doi: 10.2196/43483.

引用本文的文献

1
Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.使用来自Transformer句子嵌入的语言无关双向编码器表示法检测冗余健康调查问题:算法开发研究
JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.
2
Clinical Information Retrieval: A Literature Review.临床信息检索:文献综述
J Healthc Inform Res. 2024 Jan 23;8(2):313-352. doi: 10.1007/s41666-024-00159-4. eCollection 2024 Jun.
3
Moving toward a standardized diagnostic statement of pituitary adenoma using an information extraction model: a real-world study based on electronic medical records.

本文引用的文献

1
A clinical specific BERT developed using a huge Japanese clinical text corpus.一个使用大型日本临床文本语料库开发的临床专用 BERT。
PLoS One. 2021 Nov 9;16(11):e0259763. doi: 10.1371/journal.pone.0259763. eCollection 2021.
采用信息提取模型为垂体腺瘤制定标准化诊断陈述:基于电子病历的真实世界研究。
BMC Med Inform Decis Mak. 2022 Dec 7;22(1):319. doi: 10.1186/s12911-022-02031-0.
4
An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents.基于多种证据的生物医学文档集成语义文本相似度度量。
Comput Math Methods Med. 2022 Aug 27;2022:8238432. doi: 10.1155/2022/8238432. eCollection 2022.
5
Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.基于迁移学习的现代标准阿拉伯语和方言的语义文本相似度研究。
PLoS One. 2022 Aug 11;17(8):e0272991. doi: 10.1371/journal.pone.0272991. eCollection 2022.
6
An Entity Relationship Extraction Model Based on BERT-BLSTM-CRF for Food Safety Domain.基于 BERT-BLSTM-CRF 的食品安全领域实体关系抽取模型。
Comput Intell Neurosci. 2022 Apr 28;2022:7773259. doi: 10.1155/2022/7773259. eCollection 2022.