Suppr超能文献

生物信息学中数据库和软件名称的模糊性与变异性。

Ambiguity and variability of database and software names in bioinformatics.

作者信息

Duck Geraint, Kovacevic Aleksandar, Robertson David L, Stevens Robert, Nenadic Goran

机构信息

School of Computer Science, The University of Manchester, Oxford Road, Manchester, M13 9PL UK.

Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia.

出版信息

J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.

Abstract

BACKGROUND

There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.

RESULTS

Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.

CONCLUSIONS

Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

摘要

背景

在生物信息学中,有许多方法可用于完成各种任务,但直到最近,还没有工具能够系统地识别文献中提及的数据库和工具。在本文中,我们探讨了数据库和软件名称提及的变异性和模糊性,并比较了字典法和机器学习方法在识别这些名称方面的效果。

结果

通过开发和分析一个由60篇全文文档组成的语料库(这些文档在提及层面进行了人工标注),我们发现数据库和软件提及存在高度的变异性和模糊性。在一个由25篇全文文档组成的测试集上,基于字典查找的基线方法F值为46%,这不仅突出了变异性和模糊性,还表明新引入的资源数量众多。对于严格匹配和宽松匹配,机器学习方法的F值分别为63%(精确率为74%)和70%(精确率为83%)。我们描述了各种提及类型所存在的问题,并提出了在文献中捕捉更多数据库和软件提及的潜在方法。

结论

我们的分析表明,识别数据库和工具的提及是一项具有挑战性的任务,无法依靠当前人工整理的资源库来完成。虽然机器学习显示出了改进和前景(主要体现在精确率方面),但需要考虑更多的上下文信息才能实现较高的准确率。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验