用于快速准确识别文本中分类名称的物种和生物体资源。

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

作者信息

Pafilis Evangelos, Frankild Sune P, Fanini Lucia, Faulwetter Sarah, Pavloudi Christina, Vasileiadou Aikaterini, Arvanitidis Christos, Jensen Lars Juhl

机构信息

Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece.

出版信息

PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.

DOI:10.1371/journal.pone.0065390

PMID:23823062

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3688812/

Abstract

The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.

摘要

生物医学文献的指数级增长使得对高效、准确的文本挖掘工具的需求日益明显。在文本中识别命名的生物实体是一项核心且困难的任务。我们开发了一种高效算法，并实现了一种基于字典的命名实体识别方法，在此我们用它来识别文本中的物种和其他分类单元的名称。工具SPECIES比现有工具快一个数量级以上，且准确性相当。我们在一个现有的金标准语料库和一个由800篇摘要组成的新语料库上评估了精确率和召回率，新语料库是在工具开发后进行人工标注的。该语料库包含从代表多个分类群的期刊中选取的摘要，这有助于了解哪些类型的生物体名称难以检测，哪些容易检测。最后，我们在整个Medline数据库中标记了生物体名称，并开发了一个网络资源ORGANISMS，使广大生物学家群体能够访问这些结果。SPECIES软件是开源的，可以从http://species.jensenlab.org下载，同时还可下载字典文件和人工标注的金标准语料库。ORGANISMS网络资源可在http://organisms.jensenlab.org找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bbf1/3688812/a67cca87a012/pone.0065390.g001.jpg

相似文献

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.

S1000: a better taxonomic name corpus for biomedical information extraction.

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad369.

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.

Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.

Solr-Plant: efficient extraction of plant names from text.

BMC Bioinformatics. 2019 May 22;20(1):263. doi: 10.1186/s12859-019-2874-6.

NetiNeti: discovery of scientific names from text using machine learning methods.

BMC Bioinformatics. 2012 Aug 22;13:211. doi: 10.1186/1471-2105-13-211.

NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition.

J Biomed Inform. 2021 Jun;118:103779. doi: 10.1016/j.jbi.2021.103779. Epub 2021 Apr 9.

DISEASES: text mining and data integration of disease-gene associations.

Methods. 2015 Mar;74:83-9. doi: 10.1016/j.ymeth.2014.11.020. Epub 2014 Dec 5.

A method for named entity normalization in biomedical articles: application to diseases and plants.

BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.

引用本文的文献

Darling (v2.0): Mining disease-related databases for the detection of biomedical entity associations.

Comput Struct Biotechnol J. 2025 Jun 14;27:2626-2637. doi: 10.1016/j.csbj.2025.06.025. eCollection 2025.

GRU-SCANET: unleashing the power of GRU-based sinusoidal capture network for precision-driven named entity recognition.

Bioinform Adv. 2025 Jun 16;5(1):vbaf096. doi: 10.1093/bioadv/vbaf096. eCollection 2025.

Clinical insights: A comprehensive review of language models in medicine.

PLOS Digit Health. 2025 May 8;4(5):e0000800. doi: 10.1371/journal.pdig.0000800. eCollection 2025 May.

SciLinker: a large-scale text mining framework for mapping associations among biological entities.

Front Artif Intell. 2025 Mar 19;8:1528562. doi: 10.3389/frai.2025.1528562. eCollection 2025.

LitSumm: large language models for literature summarization of noncoding RNAs.

Database (Oxford). 2025 Feb 5;2025. doi: 10.1093/database/baaf006.

Lifestyle factors in the biomedical literature: an ontology and comprehensive resources for named entity recognition.

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae613.

CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes.

Bioinform Adv. 2024 Aug 20;4(1):vbae116. doi: 10.1093/bioadv/vbae116. eCollection 2024.

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.

Improving dictionary-based named entity recognition with deep learning.

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii45-ii52. doi: 10.1093/bioinformatics/btae402.

Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.

Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.

本文引用的文献

The gene normalization task in BioCreative III.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S2. doi: 10.1186/1471-2105-12-S8-S2.

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

For I dipped into the future.

EMBO Rep. 2010 May;11(5):345-9. doi: 10.1038/embor.2010.57.

LINNAEUS: a species name identification system for biomedical literature.

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

Literature mining: Speed reading.

Nature. 2010 Jan 28;463(7280):416-8. doi: 10.1038/463416a.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

Text processing through Web services: calling Whatizit.

Bioinformatics. 2008 Jan 15;24(2):296-8. doi: 10.1093/bioinformatics/btm557. Epub 2007 Nov 15.

AliBaba: PubMed as a graph.

Bioinformatics. 2006 Oct 1;22(19):2444-5. doi: 10.1093/bioinformatics/btl408. Epub 2006 Jul 26.

Literature mining for the biologist: from information retrieval to biological discovery.

Nat Rev Genet. 2006 Feb;7(2):119-29. doi: 10.1038/nrg1768.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于快速准确识别文本中分类名称的物种和生物体资源。

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献