Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Nara, Japan.
Methods Inf Med. 2021 Jun;60(S 01):e56-e64. doi: 10.1055/s-0041-1731390. Epub 2021 Jul 8.
Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese.
The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available.
We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records.
We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts.
The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT.
语义文本相似度(STS)捕捉文本之间的语义相似程度。它在许多自然语言处理应用中起着重要作用,例如文本摘要、问答、机器翻译、信息检索、对话系统、剽窃检测和查询排名。STS 在一般英语领域得到了广泛的研究。然而,在临床领域和英语以外的语言(如日语)中,STS 任务的资源很少。
本研究的目的是通过创建一个公开可用的日语数据集来捕捉日语临床文本之间的语义相似性。
我们创建了两个用于日语临床 STS 的数据集:(1)日语病例报告(CR 数据集)和(2)日语电子病历(EMR 数据集)。CR 数据集是从 CiNii 数据库中提取的公开病例报告创建的。EMR 数据集是从日语电子病历创建的。
我们使用基于转换器的双向编码器表示(BERT)的方法来捕捉临床领域文本之间的语义相似性。BERT 是一种流行的迁移学习方法,已被证明在处理小数据集时可以达到高精度。我们实现了两个日语预训练 BERT 模型:一个通用日语 BERT 和一个临床日语 BERT。通用日语 BERT 是在日语维基百科文本上预训练的,而临床日语 BERT 是在日语临床文本上预训练的。
BERT 模型在我们的数据集上表现良好,能够很好地捕捉语义相似性。通用日语 BERT 优于临床日语 BERT,与人工评分高度相关(CR 数据集为 0.904,EMR 数据集为 0.875)。令人意外的是,通用日语 BERT 在临床领域数据集上的表现优于临床日语 BERT。这可能是因为通用日语 BERT 与临床日语 BERT 相比,在广泛的文本上进行了预训练。