Ge Yao, Guo Yuting, Yang Yuan-Chi, Al-Garadi Mohammed Ali, Sarker Abeed
Department of Biomedical Informatics School of Medicine, Emory University Atlanta, GA.
Proc (IEEE Int Conf Healthc Inform). 2022 Jun;2022:84-89. doi: 10.1109/ichi54592.2022.00024. Epub 2022 Sep 8.
Many research problems involving medical texts have limited amounts of annotated data available (., expressions of rare diseases). Traditional supervised machine learning algorithms, particularly those based on deep neural networks, require large volumes of annotated data, and they underperform when only small amounts of labeled data are available. Few-shot learning (FSL) is a category of machine learning models that are designed with the intent of solving problems that have small annotated datasets available. However, there is no current study that compares the performances of FSL models with traditional models (., conditional random fields) for medical text at different training set sizes. In this paper, we attempted to fill this gap in research by comparing multiple FSL models with traditional models for the task of named entity recognition (NER) from medical texts. Using five health-related annotated NER datasets, we benchmarked three traditional NER models based on BERT-BERT-Linear Classifier (BLC), BERT-CRF (BC) and SANER; and three FSL NER models-StructShot & NNShot, Few-Shot Slot Tagging (FS-ST) and ProtoNER. Our benchmarking results show that almost all models, whether traditional or FSL, achieve significantly lower performances compared to the state-of-the-art with small amounts of training data. For the NER experiments we executed, the F-scores were very low with small training sets, typically below 30%. FSL models that were reported to perform well on non-medical texts significantly underperformed, compared to their reported best, on medical texts. Our experiments also suggest that FSL methods tend to perform worse on data sets from noisy sources of medical texts, such as social media (which includes misspellings and colloquial expressions), compared to less noisy sources such as medical literature. Our experiments demonstrate that the current state-of-the-art FSL systems are not yet suitable for effective NER in medical natural language processing tasks, and further research needs to be carried out to improve their performances. Creation of specialized, standardized datasets replicating real-world scenarios may help to move this category of methods forward.
许多涉及医学文本的研究问题可用的标注数据量有限(例如,罕见疾病的表述)。传统的监督式机器学习算法,尤其是那些基于深度神经网络的算法,需要大量的标注数据,而当只有少量标注数据可用时,它们的表现就会不佳。少样本学习(FSL)是一类机器学习模型,其设计目的是解决只有少量标注数据集可用的问题。然而,目前尚无研究比较FSL模型与传统模型(例如,条件随机场)在不同训练集规模下对医学文本的性能。在本文中,我们试图通过比较多个FSL模型与传统模型在医学文本命名实体识别(NER)任务中的表现来填补这一研究空白。使用五个与健康相关的标注NER数据集,我们对基于BERT的三个传统NER模型——BERT-线性分类器(BLC)、BERT-条件随机场(BC)和SANER;以及三个FSL NER模型——StructShot & NNShot、少样本槽位标记(FS-ST)和ProtoNER进行了基准测试。我们的基准测试结果表明,几乎所有模型,无论是传统模型还是FSL模型,在训练数据量较少时,与当前最先进的模型相比,性能都显著较低。对于我们执行的NER实验,小训练集的F分数非常低,通常低于30%。据报道在非医学文本上表现良好的FSL模型,与它们报道的最佳表现相比,在医学文本上的表现明显不佳。我们的实验还表明,与医学文献等噪声较小的来源相比,FSL方法在来自医学文本噪声源(如社交媒体,其中包括拼写错误和口语表达)的数据集上往往表现更差。我们的实验表明,当前最先进的FSL系统尚不适用于医学自然语言处理任务中的有效NER,需要进一步开展研究以提高其性能。创建复制现实世界场景的专门、标准化数据集可能有助于推动这类方法的发展。