National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, 20894, USA.
School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116023, China.
Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.
Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.
分布式单词表示已成为生物医学自然语言处理 (BioNLP)、文本挖掘和信息检索的重要基础。单词嵌入通常是从大量未标记的文本中在单词级别上计算的,忽略了单词内部结构中存在的信息或任何在特定领域的结构化资源(如本体)中可用的信息。然而,正如一些在一般领域的最近研究中所表明的那样,这种信息具有极大地提高单词表示质量的潜力。在这里,我们提出了 BioWordVec:一组开放的生物医学单词向量/嵌入,它结合了来自未标记的生物医学文本的子词信息和一种广泛使用的生物医学受控词汇,称为医学主题词 (MeSH)。我们在多个生物医学领域的 NLP 任务上评估了我们生成的单词嵌入的有效性和实用性。我们的基准测试结果表明,在这些具有挑战性的任务中,我们的单词嵌入可以显著提高性能,超过之前的最先进水平。