基于ALBERT的自集成模型，结合半监督学习和数据增强用于临床语义文本相似度计算：算法验证研究

ALBERT-Based Self-Ensemble Model With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study.

作者信息

Li Junyi, Zhang Xuejie, Zhou Xiaobing

机构信息

School of Information Science and Engineering, Yunnan University, Kunming, China.

出版信息

JMIR Med Inform. 2021 Jan 22;9(1):e23086. doi: 10.2196/23086.

DOI:10.2196/23086

PMID:33480858

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7864778/

Abstract

BACKGROUND

In recent years, with increases in the amount of information available and the importance of information screening, increased attention has been paid to the calculation of textual semantic similarity. In the field of medicine, electronic medical records and medical research documents have become important data resources for clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved.

OBJECTIVE

This research aims to solve 2 problems-(1) when the size of medical data sets is small, leading to insufficient learning with understanding of the models and (2) when information is lost in the process of long-distance propagation, causing the models to be unable to grasp key information.

METHODS

This paper combines a text data augmentation method and a self-ensemble ALBERT model under semisupervised learning to perform clinical textual semantic similarity calculations.

RESULTS

Compared with the methods in the 2019 National Natural Language Processing Clinical Challenges Open Health Natural Language Processing shared task Track on Clinical Semantic Textual Similarity, our method surpasses the best result by 2 percentage points and achieves a Pearson correlation coefficient of 0.92.

CONCLUSIONS

When the size of medical data set is small, data augmentation can increase the size of the data set and improved semisupervised learning can boost the learning efficiency of the model. Additionally, self-ensemble methods improve the model performance. Our method had excellent performance and has great potential to improve related medical problems.

摘要

背景

近年来，随着可用信息量的增加以及信息筛选的重要性，文本语义相似度计算受到了越来越多的关注。在医学领域，电子病历和医学研究文档已成为临床研究的重要数据资源。医学文本语义相似度计算已成为亟待解决的问题。

目的

本研究旨在解决两个问题——（1）当医学数据集规模较小时，导致模型学习理解不足；（2）当信息在长距离传播过程中丢失时，导致模型无法把握关键信息。

方法

本文在半监督学习下结合文本数据增强方法和自集成ALBERT模型进行临床文本语义相似度计算。

结果

与2019年全国自然语言处理临床挑战开放健康自然语言处理共享任务临床语义文本相似度赛道中的方法相比，我们的方法比最佳结果高出2个百分点，皮尔逊相关系数达到0.92。

结论

当医学数据集规模较小时，数据增强可以增加数据集规模，改进的半监督学习可以提高模型的学习效率。此外，自集成方法可提升模型性能。我们的方法具有优异的性能，在改善相关医学问题方面具有巨大潜力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于ALBERT的自集成模型，结合半监督学习和数据增强用于临床语义文本相似度计算：算法验证研究

ALBERT-Based Self-Ensemble Model With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

基于ALBERT的自集成模型，结合半监督学习和数据增强用于临床语义文本相似度计算：算法验证研究

ALBERT-Based Self-Ensemble Model With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献