生物模拟动词和生物模拟词汇：生物医学中词汇相似度的广泛覆盖评估集。

Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine.

机构信息

Language Technology Laboratory, DTAL, University of Cambridge, 9 West Road, Cambridge, CB39DB, UK.

出版信息

BMC Bioinformatics. 2018 Feb 5;19(1):33. doi: 10.1186/s12859-018-2039-z.

DOI:10.1186/s12859-018-2039-z

PMID:29402212

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5800055/

Abstract

BACKGROUND

Word representations support a variety of Natural Language Processing (NLP) tasks. The quality of these representations is typically assessed by comparing the distances in the induced vector spaces against human similarity judgements. Whereas comprehensive evaluation resources have recently been developed for the general domain, similar resources for biomedicine currently suffer from the lack of coverage, both in terms of word types included and with respect to the semantic distinctions. Notably, verbs have been excluded, although they are essential for the interpretation of biomedical language. Further, current resources do not discern between semantic similarity and semantic relatedness, although this has been proven as an important predictor of the usefulness of word representations and their performance in downstream applications.

RESULTS

We present two novel comprehensive resources targeting the evaluation of word representations in biomedicine. These resources, Bio-SimVerb and Bio-SimLex, address the previously mentioned problems, and can be used for evaluations of verb and noun representations respectively. In our experiments, we have computed the Pearson's correlation between performances on intrinsic and extrinsic tasks using twelve popular state-of-the-art representation models (e.g. word2vec models). The intrinsic-extrinsic correlations using our datasets are notably higher than with previous intrinsic evaluation benchmarks such as UMNSRS and MayoSRS. In addition, when evaluating representation models for their abilities to capture verb and noun semantics individually, we show a considerable variation between performances across all models.

CONCLUSION

Bio-SimVerb and Bio-SimLex enable intrinsic evaluation of word representations. This evaluation can serve as a predictor of performance on various downstream tasks in the biomedical domain. The results on Bio-SimVerb and Bio-SimLex using standard word representation models highlight the importance of developing dedicated evaluation resources for NLP in biomedicine for particular word classes (e.g. verbs). These are needed to identify the most accurate methods for learning class-specific representations. Bio-SimVerb and Bio-SimLex are publicly available.

摘要

背景

词向量表示支持多种自然语言处理（NLP）任务。这些表示的质量通常通过将诱导向量空间中的距离与人类相似性判断进行比较来评估。尽管最近已经为一般领域开发了全面的评估资源，但生物医学领域的类似资源在涵盖的词类以及语义区分方面都存在不足。值得注意的是，动词被排除在外，尽管它们对于生物医学语言的解释至关重要。此外，当前的资源无法区分语义相似性和语义相关性，尽管这已被证明是词向量表示的有用性及其在下游应用中的性能的重要预测因素。

结果

我们提出了两个针对生物医学中词向量表示评估的全新全面资源。这些资源，Bio-SimVerb 和 Bio-SimLex，解决了前面提到的问题，可分别用于动词和名词表示的评估。在我们的实验中，我们使用十二种流行的最先进的表示模型（例如 word2vec 模型）计算了内在和外在任务性能之间的皮尔逊相关系数。使用我们的数据集的内在-外在相关性明显高于以前的内在评估基准，例如 UMNSRS 和 MayoSRS。此外，当评估表示模型捕获动词和名词语义的能力时，我们展示了所有模型之间的性能差异相当大。

结论

Bio-SimVerb 和 Bio-SimLex 实现了词向量表示的内在评估。这种评估可以作为生物医学领域各种下游任务性能的预测因素。使用标准词表示模型在 Bio-SimVerb 和 Bio-SimLex 上的结果强调了为特定词类（例如动词）开发生物医学 NLP 专用评估资源的重要性。这对于识别学习特定类别的表示的最准确方法是必要的。Bio-SimVerb 和 Bio-SimLex 可供公开使用。