Li Junyi, Zhang Xuejie, Zhou Xiaobing
School of Information Science and Engineering, Yunnan University, Kunming, China.
JMIR Med Inform. 2021 Jan 22;9(1):e23086. doi: 10.2196/23086.
In recent years, with increases in the amount of information available and the importance of information screening, increased attention has been paid to the calculation of textual semantic similarity. In the field of medicine, electronic medical records and medical research documents have become important data resources for clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved.
This research aims to solve 2 problems-(1) when the size of medical data sets is small, leading to insufficient learning with understanding of the models and (2) when information is lost in the process of long-distance propagation, causing the models to be unable to grasp key information.
This paper combines a text data augmentation method and a self-ensemble ALBERT model under semisupervised learning to perform clinical textual semantic similarity calculations.
Compared with the methods in the 2019 National Natural Language Processing Clinical Challenges Open Health Natural Language Processing shared task Track on Clinical Semantic Textual Similarity, our method surpasses the best result by 2 percentage points and achieves a Pearson correlation coefficient of 0.92.
When the size of medical data set is small, data augmentation can increase the size of the data set and improved semisupervised learning can boost the learning efficiency of the model. Additionally, self-ensemble methods improve the model performance. Our method had excellent performance and has great potential to improve related medical problems.
近年来,随着可用信息量的增加以及信息筛选的重要性,文本语义相似度计算受到了越来越多的关注。在医学领域,电子病历和医学研究文档已成为临床研究的重要数据资源。医学文本语义相似度计算已成为亟待解决的问题。
本研究旨在解决两个问题——(1)当医学数据集规模较小时,导致模型学习理解不足;(2)当信息在长距离传播过程中丢失时,导致模型无法把握关键信息。
本文在半监督学习下结合文本数据增强方法和自集成ALBERT模型进行临床文本语义相似度计算。
与2019年全国自然语言处理临床挑战开放健康自然语言处理共享任务临床语义文本相似度赛道中的方法相比,我们的方法比最佳结果高出2个百分点,皮尔逊相关系数达到0.92。
当医学数据集规模较小时,数据增强可以增加数据集规模,改进的半监督学习可以提高模型的学习效率。此外,自集成方法可提升模型性能。我们的方法具有优异的性能,在改善相关医学问题方面具有巨大潜力。