School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
School of Computer Science and Engineering, Dalian Minzu University, Dalian 116600, China.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad496.
Few-shot learning that can effectively perform named entity recognition in low-resource scenarios has raised growing attention, but it has not been widely studied yet in the biomedical field. In contrast to high-resource domains, biomedical named entity recognition (BioNER) often encounters limited human-labeled data in real-world scenarios, leading to poor generalization performance when training only a few labeled instances. Recent approaches either leverage cross-domain high-resource data or fine-tune the pre-trained masked language model using limited labeled samples to generate new synthetic data, which is easily stuck in domain shift problems or yields low-quality synthetic data. Therefore, in this article, we study a more realistic scenario, i.e. few-shot learning for BioNER.
Leveraging the domain knowledge graph, we propose knowledge-guided instance generation for few-shot BioNER, which generates diverse and novel entities based on similar semantic relations of neighbor nodes. In addition, by introducing question prompt, we cast BioNER as question-answering task and propose prompt contrastive learning to improve the robustness of the model by measuring the mutual information between query-answer pairs. Extensive experiments conducted on various few-shot settings show that the proposed framework achieves superior performance. Particularly, in a low-resource scenario with only 20 samples, our approach substantially outperforms recent state-of-the-art models on four benchmark datasets, achieving an average improvement of up to 7.1% F1.
Our source code and data are available at https://github.com/cpmss521/KGPC.
在资源有限的情况下,能够有效地进行命名实体识别的少样本学习引起了越来越多的关注,但在生物医学领域尚未得到广泛研究。与高资源领域相比,生物医学命名实体识别(BioNER)在实际场景中经常遇到有限的人工标记数据,仅使用少量标记实例进行训练时,泛化性能较差。最近的方法要么利用跨领域的高资源数据,要么使用有限的标记样本微调预先训练的掩蔽语言模型来生成新的合成数据,这很容易陷入领域转移问题或产生低质量的合成数据。因此,在本文中,我们研究了一个更现实的场景,即生物医学命名实体识别的少样本学习。
利用领域知识图谱,我们提出了基于知识的实例生成方法,用于少样本生物医学命名实体识别,该方法基于邻居节点的相似语义关系生成多样化和新颖的实体。此外,通过引入问题提示,我们将生物医学命名实体识别转化为问答任务,并提出提示对比学习,通过测量查询-答案对之间的互信息来提高模型的鲁棒性。在各种少样本设置下进行的广泛实验表明,所提出的框架实现了卓越的性能。特别是在仅有 20 个样本的低资源场景下,我们的方法在四个基准数据集上显著优于最新的最先进模型,平均 F1 提高了 7.1%。
我们的源代码和数据可在 https://github.com/cpmss521/KGPC 上获得。