Ramos-Vargas Rigo E, Román-Godínez Israel, Torres-Ramos Sulema
Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Jalisco, México.
PeerJ Comput Sci. 2021 Feb 18;7:e384. doi: 10.7717/peerj-cs.384. eCollection 2021.
Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.
人们对将词嵌入(如词表示)用于生物医学命名实体识别(BioNER)的兴趣日益浓厚,这凸显了进行有助于选择最佳词嵌入的评估的必要性。选择词嵌入的一个常见标准是其生成来源的类型;即一般来源(如维基百科、通用爬虫)或特定来源(如生物医学文献)。考虑到特定词嵌入在医学实体之间提供了更好的覆盖范围和语义关系,强烈建议在BioNER任务中使用它们。据我们所知,大多数研究一方面通过结合从文本中提取的多种特征(例如,语言、形态、字符嵌入和词嵌入本身),另一方面通过测试几种先进的命名实体识别算法,来专注于提高BioNER任务的性能。然而,后者并没有充分关注词嵌入的影响,也不利于观察它们对BioNER任务的实际影响。因此,本研究使用两种经典词嵌入(通用类型的GloVe Common Crawl和特定的Pyysalo PM + PMC)作为唯一特征,针对两个语料库(DrugBank和MedLine)评估了三种著名的NER算法(CRF、BiLSTM、BiLSTM - CRF)。此外,还比较了三种上下文词嵌入(ELMo、Pooled Flair和Transformer)的通用版本和特定版本。目的是确定通用嵌入在BioNER任务上是否能比专门的嵌入表现更好。为此,设计了四个实验。在第一个实验中,我们试图确定经典词嵌入、NER算法和语料库的组合,以获得最佳性能。第二个实验评估了语料库大小对性能的影响。第三个实验评估了经典词嵌入的语义凝聚力及其与几个黄金标准的相关性;而第四个实验评估了通用和特定上下文词嵌入在BioNER任务上的性能。结果表明,经典通用词嵌入GloVe Common Crawl在DrugBank语料库中表现更好,尽管其词覆盖范围和内部语义关系比经典特定词嵌入Pyysalo PM + PMC要少;而在上下文词嵌入中,特定版本的表现最佳。因此,我们得出结论,在BioNER任务中使用经典词嵌入作为特征时,通用词嵌入可以被视为一个不错的选择。另一方面,在使用上下文词嵌入时,特定词嵌入是最佳选择。