Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan.
National Centre for Text Mining, Computer Science, University of Manchester 131 Princess Street, M7 1DN, UK.
Bioinformatics. 2022 Jan 12;38(3):872-874. doi: 10.1093/bioinformatics/btab702.
Large-scale pre-trained language models (PLMs) have advanced state-of-the-art (SOTA) performance on various biomedical text mining tasks. The power of such PLMs can be combined with the advantages of deep generative models. These are examples of these combinations. However, they are trained only on general domain text, and biomedical models are still missing. In this work, we describe BioVAE, the first large-scale pre-trained latent variable language model for the biomedical domain, which uses the OPTIMUS framework to train on large volumes of biomedical text. The model shows SOTA performance on several biomedical text mining tasks when compared to existing publicly available biomedical PLMs. In addition, our model can generate more accurate biomedical sentences than the original OPTIMUS output.
Our source code and pre-trained models are freely available: https://github.com/aistairc/BioVAE.
Supplementary data are available at Bioinformatics online.
大规模预训练语言模型 (PLM) 在各种生物医学文本挖掘任务上取得了先进的性能。这种 PLM 的功能可以与深度生成模型的优势相结合。这些是这些组合的例子。然而,它们仅在通用领域的文本上进行训练,而生物医学模型仍然缺失。在这项工作中,我们描述了 BioVAE,这是第一个用于生物医学领域的大规模预训练潜在变量语言模型,它使用 OPTIMUS 框架在大量生物医学文本上进行训练。与现有的公开可用的生物医学 PLM 相比,该模型在几个生物医学文本挖掘任务上表现出了 SOTA 性能。此外,我们的模型可以比原始 OPTIMUS 输出生成更准确的生物医学句子。
我们的源代码和预训练模型可在以下网址免费获取:https://github.com/aistairc/BioVAE。
补充数据可在生物信息学在线获得。