使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.
作者信息
Oniani David, Chandrasekar Premkumar, Sivarajkumar Sonish, Wang Yanshan
机构信息
Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States.
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States.
出版信息
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
BACKGROUND
Natural language processing (NLP) has become an emerging technology in health care that leverages a large amount of free-text data in electronic health records to improve patient care, support clinical decisions, and facilitate clinical and translational science research. Recently, deep learning has achieved state-of-the-art performance in many clinical NLP tasks. However, training deep learning models often requires large, annotated data sets, which are normally not publicly available and can be time-consuming to build in clinical domains. Working with smaller annotated data sets is typical in clinical NLP; therefore, ensuring that deep learning models perform well is crucial for real-world clinical NLP applications. A widely adopted approach is fine-tuning existing pretrained language models, but these attempts fall short when the training data set contains only a few annotated samples. Few-shot learning (FSL) has recently been investigated to tackle this problem. Siamese neural network (SNN) has been widely used as an FSL approach in computer vision but has not been studied well in NLP. Furthermore, the literature on its applications in clinical domains is scarce.
OBJECTIVE
The aim of our study is to propose and evaluate SNN-based approaches for few-shot clinical NLP tasks.
METHODS
We propose 2 SNN-based FSL approaches, including pretrained SNN and SNN with second-order embeddings. We evaluate the proposed approaches on the clinical sentence classification task. We experiment with 3 few-shot settings, including 4-shot, 8-shot, and 16-shot learning. The clinical NLP task is benchmarked using the following 4 pretrained language models: bidirectional encoder representations from transformers (BERT), BERT for biomedical text mining (BioBERT), BioBERT trained on clinical notes (BioClinicalBERT), and generative pretrained transformer 2 (GPT-2). We also present a performance comparison between SNN-based approaches and the prompt-based GPT-2 approach.
RESULTS
In 4-shot sentence classification tasks, GPT-2 had the highest precision (0.63), but its recall (0.38) and F score (0.42) were lower than those of BioBERT-based pretrained SNN (0.45 and 0.46, respectively). In both 8-shot and 16-shot settings, SNN-based approaches outperformed GPT-2 in all 3 metrics of precision, recall, and F score.
CONCLUSIONS
The experimental results verified the effectiveness of the proposed SNN approaches for few-shot clinical NLP tasks.
背景
自然语言处理(NLP)已成为医疗保健领域的一项新兴技术,它利用电子健康记录中的大量自由文本数据来改善患者护理、支持临床决策并促进临床和转化科学研究。最近,深度学习在许多临床NLP任务中取得了领先的性能。然而,训练深度学习模型通常需要大量带注释的数据集,这些数据集通常不公开,并且在临床领域构建起来可能很耗时。在临床NLP中,使用较小的带注释数据集是很常见的;因此,确保深度学习模型表现良好对于实际临床NLP应用至关重要。一种广泛采用的方法是对现有的预训练语言模型进行微调,但当训练数据集只包含少量带注释的样本时,这些尝试效果不佳。最近,少样本学习(FSL)已被研究用于解决这个问题。暹罗神经网络(SNN)在计算机视觉中已被广泛用作一种FSL方法,但在NLP中尚未得到充分研究。此外,关于其在临床领域应用的文献也很稀少。
目的
我们研究的目的是提出并评估基于SNN的少样本临床NLP任务方法。
方法
我们提出了2种基于SNN的FSL方法,包括预训练SNN和具有二阶嵌入的SNN。我们在临床句子分类任务上评估所提出的方法。我们在3种少样本设置下进行实验,包括4样本、8样本和16样本学习。临床NLP任务使用以下4种预训练语言模型进行基准测试:来自变换器的双向编码器表示(BERT)、用于生物医学文本挖掘的BERT(BioBERT)、在临床笔记上训练的BioBERT(BioClinicalBERT)和生成式预训练变换器2(GPT-2)。我们还展示了基于SNN的方法与基于提示的GPT-2方法之间的性能比较。
结果
在4样本句子分类任务中,GPT-2的精度最高(0.63),但其召回率(0.38)和F分数(0.42)低于基于BioBERT的预训练SNN(分别为0.45和0.46)。在8样本和16样本设置下,基于SNN的方法在精度、召回率和F分数这3个指标上均优于GPT-2。
结论
实验结果验证了所提出的基于SNN的方法在少样本临床NLP任务中的有效性。