Suppr超能文献

生物医学文献的实体链接

Entity linking for biomedical literature.

作者信息

Zheng Jin G, Howsmon Daniel, Zhang Boliang, Hahn Juergen, McGuinness Deborah, Hendler James, Ji Heng

出版信息

BMC Med Inform Decis Mak. 2015;15 Suppl 1(Suppl 1):S4. doi: 10.1186/1472-6947-15-S1-S4. Epub 2015 May 20.

Abstract

BACKGROUND

The Entity Linking (EL) task links entity mentions from an unstructured document to entities in a knowledge base. Although this problem is well-studied in news and social media, this problem has not received much attention in the life science domain. One outcome of tackling the EL problem in the life sciences domain is to enable scientists to build computational models of biological processes with more efficiency. However, simply applying a news-trained entity linker produces inadequate results.

METHODS

Since existing supervised approaches require a large amount of manually-labeled training data, which is currently unavailable for the life science domain, we propose a novel unsupervised collective inference approach to link entities from unstructured full texts of biomedical literature to 300 ontologies. The approach leverages the rich semantic information and structures in ontologies for similarity computation and entity ranking.

RESULTS

Without using any manual annotation, our approach significantly outperforms state-of-the-art supervised EL method (9% absolute gain in linking accuracy). Furthermore, the state-of-the-art supervised EL method requires 15,000 manually annotated entity mentions for training. These promising results establish a benchmark for the EL task in the life science domain. We also provide in depth analysis and discussion on both challenges and opportunities on automatic knowledge enrichment for scientific literature.

CONCLUSIONS

In this paper, we propose a novel unsupervised collective inference approach to address the EL problem in a new domain. We show that our unsupervised approach is able to outperform a current state-of-the-art supervised approach that has been trained with a large amount of manually labeled data. Life science presents an underrepresented domain for applying EL techniques. By providing a small benchmark data set and identifying opportunities, we hope to stimulate discussions across natural language processing and bioinformatics and motivate others to develop techniques for this largely untapped domain.

摘要

背景

实体链接(EL)任务将非结构化文档中的实体提及与知识库中的实体进行链接。尽管这个问题在新闻和社交媒体领域已经得到了充分研究,但在生命科学领域却没有受到太多关注。在生命科学领域解决EL问题的一个成果是使科学家能够更高效地构建生物过程的计算模型。然而,简单应用经过新闻训练的实体链接器会产生不尽人意的结果。

方法

由于现有的监督方法需要大量人工标注的训练数据,而目前生命科学领域无法获得这些数据,我们提出了一种新颖的无监督集体推理方法,将生物医学文献非结构化全文中的实体与300个本体进行链接。该方法利用本体中丰富的语义信息和结构进行相似度计算和实体排序。

结果

在不使用任何人工标注的情况下,我们的方法显著优于当前最先进的监督EL方法(链接准确率绝对提高9%)。此外,当前最先进的监督EL方法需要15000个经过人工标注的实体提及用于训练。这些令人鼓舞的结果为生命科学领域的EL任务建立了一个基准。我们还对科学文献自动知识丰富的挑战和机遇进行了深入分析和讨论。

结论

在本文中,我们提出了一种新颖的无监督集体推理方法来解决新领域中的EL问题。我们表明,我们的无监督方法能够优于当前最先进的、使用大量人工标注数据训练的监督方法。生命科学是应用EL技术的一个代表性不足的领域。通过提供一个小型基准数据集并识别机遇,我们希望激发自然语言处理和生物信息学领域的讨论,并激励其他人开发适用于这个基本上未开发领域的技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c6c/4460707/b0330a05d364/1472-6947-15-S1-S4-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验