• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

内提内提:使用机器学习方法从文本中发现科学名称。

NetiNeti: discovery of scientific names from text using machine learning methods.

机构信息

MBLWHOI Library, Marine Biological Laboratory, Woods Hole, MA, USA.

出版信息

BMC Bioinformatics. 2012 Aug 22;13:211. doi: 10.1186/1471-2105-13-211.

DOI:10.1186/1471-2105-13-211
PMID:22913485
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3542245/
Abstract

BACKGROUND

A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.

RESULTS

We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central's full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages.

CONCLUSIONS

We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.

摘要

背景

生物的学名可以与几乎所有的生物数据相关联。名称识别是许多旨在从生物、生物医学和生物多样性文本源中提取有用信息的文本挖掘任务的重要步骤。学名是链接生物信息的重要元数据元素。

结果

我们提出了 NetiNeti(从文本信息中提取名称-分类索引的名称提取),这是一种基于机器学习的方法,用于识别包括从文本中发现新物种名称的科学名称,也可以处理拼写错误、OCR 错误和名称的其他变体。该系统使用科学名称规则生成候选名称,并应用概率机器学习方法根据候选名称的结构特征和从其上下文中提取的特征对名称进行分类。NetiNeti 还可以使用上下文信息从其他名称中区分科学名称。我们在遗留生物多样性文本和生物医学文献(MEDLINE)上评估了 NetiNeti。与基于流行字典的方法(精度=97.5%,召回率=54.3%)相比,NetiNeti 在 600 页由注释者手动标记的生物多样性书籍上的表现更好(精度=98.9%,召回率=70.5%)。在一个由 PubMed Central 的全文文章组成的小集合上,用科学名称进行注释,精度和召回率分别为 98.5%和 96.2%。当在整个 MEDLINE 数据库上使用时,NetiNeti 在超过 188 万 PubMed 记录中发现了超过 19 万个独特的二项式和三项式名称。NetiNeti 还成功识别了网页中提到的几乎所有新物种名称。

结论

我们提出了 NetiNeti,这是一种基于机器学习的识别和发现科学名称的方法。实现该方法的系统可在 http://namefinding.ubio.org 访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/93413d70cec3/1471-2105-13-211-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/562450bb1c42/1471-2105-13-211-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/9cd8ac7d45ee/1471-2105-13-211-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/93413d70cec3/1471-2105-13-211-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/562450bb1c42/1471-2105-13-211-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/9cd8ac7d45ee/1471-2105-13-211-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/14f4/3542245/93413d70cec3/1471-2105-13-211-3.jpg

相似文献

1
NetiNeti: discovery of scientific names from text using machine learning methods.内提内提:使用机器学习方法从文本中发现科学名称。
BMC Bioinformatics. 2012 Aug 22;13:211. doi: 10.1186/1471-2105-13-211.
2
Building a protein name dictionary from full text: a machine learning term extraction approach.从全文构建蛋白质名称词典:一种机器学习术语提取方法。
BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.
3
OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.生物标记器:在生物医学文献中检测、规范和定位生物实体。
Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.
4
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.用于快速准确识别文本中分类名称的物种和生物体资源。
PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.
5
Protein names precisely peeled off free text.蛋白质名称从自由文本中精确提取。
Bioinformatics. 2004 Aug 4;20 Suppl 1:i241-7. doi: 10.1093/bioinformatics/bth904.
6
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.
7
"gnparser": a powerful parser for scientific names based on Parsing Expression Grammar.“gnparser”:一种基于解析表达式语法的强大的学名解析器。
BMC Bioinformatics. 2017 May 26;18(1):279. doi: 10.1186/s12859-017-1663-3.
8
Recognition of Latin scientific names using artificial neural networks.使用人工神经网络识别拉丁学名。
Appl Plant Sci. 2020 Jul 31;8(7):e11378. doi: 10.1002/aps3.11378. eCollection 2020 Jul.
9
Solr-Plant: efficient extraction of plant names from text.Solr-Plant:从文本中高效提取植物名称。
BMC Bioinformatics. 2019 May 22;20(1):263. doi: 10.1186/s12859-019-2874-6.
10
LINNAEUS: a species name identification system for biomedical literature.林奈氏:生物医学文献的物种名称识别系统。
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

引用本文的文献

1
OpCitance: Citation contexts identified from the PubMed Central open access articles.文献被引频次:从 PubMed Central 开放获取文章中识别出的引文环境。
Sci Data. 2023 Apr 28;10(1):243. doi: 10.1038/s41597-023-02134-x.
2
Recognition of Latin scientific names using artificial neural networks.使用人工神经网络识别拉丁学名。
Appl Plant Sci. 2020 Jul 31;8(7):e11378. doi: 10.1002/aps3.11378. eCollection 2020 Jul.
3
COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.

本文引用的文献

1
Names are key to the big new biology.名称是新生物学的关键。
Trends Ecol Evol. 2010 Dec;25(12):686-91. doi: 10.1016/j.tree.2010.09.004. Epub 2010 Oct 18.
2
LINNAEUS: a species name identification system for biomedical literature.林奈氏:生物医学文献的物种名称识别系统。
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.
3
Disambiguating the species of biomedical named entities using natural language parsers.利用自然语言解析器对生物医学命名实体进行消歧。
COPIOUS:一个用于从生物多样性文献中提取物种出现信息的命名实体黄金标准语料库。
Biodivers Data J. 2019 Jan 22(7):e29626. doi: 10.3897/BDJ.7.e29626. eCollection 2019.
4
"gnparser": a powerful parser for scientific names based on Parsing Expression Grammar.“gnparser”:一种基于解析表达式语法的强大的学名解析器。
BMC Bioinformatics. 2017 May 26;18(1):279. doi: 10.1186/s12859-017-1663-3.
5
Constructing a biodiversity terminological inventory.构建生物多样性术语库
PLoS One. 2017 Apr 17;12(4):e0175277. doi: 10.1371/journal.pone.0175277. eCollection 2017.
6
Digitising legacy zoological taxonomic literature: Processes, products and using the output.将传统动物分类学文献数字化:流程、成果及成果应用
Zookeys. 2016 Jan 7(550):189-206. doi: 10.3897/zookeys.550.9702. eCollection 2016.
7
Entity linking for biomedical literature.生物医学文献的实体链接
BMC Med Inform Decis Mak. 2015;15 Suppl 1(Suppl 1):S4. doi: 10.1186/1472-6947-15-S1-S4. Epub 2015 May 20.
8
Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi.大海捞针:为真菌关联学名、参考标本和分子数据
Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau061. Print 2014.
9
Knowledge extraction and semantic annotation of text from the encyclopedia of life.来自生命百科全书的文本的知识提取与语义标注。
PLoS One. 2014 Mar 3;9(3):e89550. doi: 10.1371/journal.pone.0089550. eCollection 2014.
10
BioNames: linking taxonomy, texts, and trees.生物命名:连接分类法、文本和树。
PeerJ. 2013 Oct 29;1:e190. doi: 10.7717/peerj.190. eCollection 2013.
Bioinformatics. 2010 Mar 1;26(5):661-7. doi: 10.1093/bioinformatics/btq002. Epub 2010 Jan 6.
4
Distinguishing the species of biomedical named entities for term identification.区分生物医学命名实体的物种以进行术语识别。
BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S6. doi: 10.1186/1471-2105-9-S11-S6.
5
Inter-species normalization of gene mentions with GNAT.使用GNAT对基因提及进行种间标准化。
Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.
6
Text processing through Web services: calling Whatizit.通过网络服务进行文本处理:调用Whatizit。
Bioinformatics. 2008 Jan 15;24(2):296-8. doi: 10.1093/bioinformatics/btm557. Epub 2007 Nov 15.
7
Biodiversity informatics: organizing and linking information across the spectrum of life.生物多样性信息学:整合与关联生命全谱信息。
Brief Bioinform. 2007 Sep;8(5):347-57. doi: 10.1093/bib/bbm037. Epub 2007 Aug 17.
8
TBMap: a taxonomic perspective on the phylogenetic database TreeBASE.TBMap:关于系统发育数据库TreeBASE的分类学视角。
BMC Bioinformatics. 2007 May 18;8:158. doi: 10.1186/1471-2105-8-158.
9
uBioRSS: tracking taxonomic literature using RSS.uBioRSS:使用RSS跟踪分类学文献。
Bioinformatics. 2007 Jun 1;23(11):1434-6. doi: 10.1093/bioinformatics/btm109. Epub 2007 Mar 28.
10
Building an abbreviation dictionary using a term recognition approach.使用术语识别方法构建缩写词典。
Bioinformatics. 2006 Dec 15;22(24):3089-95. doi: 10.1093/bioinformatics/btl534. Epub 2006 Oct 18.