生物信息学中数据库和软件名称的模糊性与变异性。

Ambiguity and variability of database and software names in bioinformatics.

作者信息

Duck Geraint, Kovacevic Aleksandar, Robertson David L, Stevens Robert, Nenadic Goran

机构信息

School of Computer Science, The University of Manchester, Oxford Road, Manchester, M13 9PL UK.

Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia.

出版信息

J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.

DOI:10.1186/s13326-015-0026-0

PMID:26131352

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4485340/

Abstract

BACKGROUND

There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.

RESULTS

Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.

CONCLUSIONS

Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

摘要

背景

在生物信息学中，有许多方法可用于完成各种任务，但直到最近，还没有工具能够系统地识别文献中提及的数据库和工具。在本文中，我们探讨了数据库和软件名称提及的变异性和模糊性，并比较了字典法和机器学习方法在识别这些名称方面的效果。

结果

通过开发和分析一个由60篇全文文档组成的语料库（这些文档在提及层面进行了人工标注），我们发现数据库和软件提及存在高度的变异性和模糊性。在一个由25篇全文文档组成的测试集上，基于字典查找的基线方法F值为46%，这不仅突出了变异性和模糊性，还表明新引入的资源数量众多。对于严格匹配和宽松匹配，机器学习方法的F值分别为63%（精确率为74%）和70%（精确率为83%）。我们描述了各种提及类型所存在的问题，并提出了在文献中捕捉更多数据库和软件提及的潜在方法。

结论

我们的分析表明，识别数据库和工具的提及是一项具有挑战性的任务，无法依靠当前人工整理的资源库来完成。虽然机器学习显示出了改进和前景（主要体现在精确率方面），但需要考虑更多的上下文信息才能实现较高的准确率。

相似文献

Ambiguity and variability of database and software names in bioinformatics.

J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.

LINNAEUS: a species name identification system for biomedical literature.

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

bioNerDS: exploring bioinformatics' database and software use through literature mining.

BMC Bioinformatics. 2013 Jun 15;14:194. doi: 10.1186/1471-2105-14-194.

A cascaded approach to normalising gene mentions in biomedical literature.

Bioinformation. 2007 Dec 30;2(5):197-206. doi: 10.6026/97320630002197.

Automated recognition of brain region mentions in neuroscience literature.

Front Neuroinform. 2009 Sep 1;3:29. doi: 10.3389/neuro.11.029.2009. eCollection 2009.

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

Cell line name recognition in support of the identification of synthetic lethality in cancer from text.

Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.

BioCreative III interactive task: an overview.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2105-12-S8-S4.

Collective instance-level gene normalization on the IGN corpus.

PLoS One. 2013 Nov 25;8(11):e79517. doi: 10.1371/journal.pone.0079517. eCollection 2013.

A Survey of Bioinformatics Database and Software Usage through Mining the Literature.

PLoS One. 2016 Jun 22;11(6):e0157989. doi: 10.1371/journal.pone.0157989. eCollection 2016.

引用本文的文献

A Survey of Bioinformatics Database and Software Usage through Mining the Literature.

PLoS One. 2016 Jun 22;11(6):e0157989. doi: 10.1371/journal.pone.0157989. eCollection 2016.

本文引用的文献

Extracting patterns of database and software usage from the bioinformatics literature.

Bioinformatics. 2014 Sep 1;30(17):i601-8. doi: 10.1093/bioinformatics/btu471.

bioNerDS: exploring bioinformatics' database and software use through literature mining.

BMC Bioinformatics. 2013 Jun 15;14:194. doi: 10.1186/1471-2105-14-194.

Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives.

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):859-66. doi: 10.1136/amiajnl-2013-001625. Epub 2013 Apr 20.

ExPASy: SIB bioinformatics resource portal.

Nucleic Acids Res. 2012 Jul;40(Web Server issue):W597-603. doi: 10.1093/nar/gks400. Epub 2012 May 31.

Reorganizing the protein space at the Universal Protein Resource (UniProt).

Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. doi: 10.1093/nar/gkr981. Epub 2011 Nov 18.

The 2011 Bioinformatics Links Directory update: more resources, tools and databases and features to empower the bioinformatics community.

Nucleic Acids Res. 2011 Jul;39(Web Server issue):W3-7. doi: 10.1093/nar/gkr514.

Using workflows to explore and optimise named entity recognition for chemistry.

PLoS One. 2011;6(5):e20181. doi: 10.1371/journal.pone.0020181. Epub 2011 May 25.

The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection.

Nucleic Acids Res. 2011 Jan;39(Database issue):D1-6. doi: 10.1093/nar/gkq1243.

LINNAEUS: a species name identification system for biomedical literature.

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature.

BMC Bioinformatics. 2009 Oct 7;10:320. doi: 10.1186/1471-2105-10-320.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生物信息学中数据库和软件名称的模糊性与变异性。

Ambiguity and variability of database and software names in bioinformatics.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献