MBLWHOI Library, Marine Biological Laboratory, Woods Hole, MA, USA.
BMC Bioinformatics. 2012 Aug 22;13:211. doi: 10.1186/1471-2105-13-211.
A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.
We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central's full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages.
We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.
生物的学名可以与几乎所有的生物数据相关联。名称识别是许多旨在从生物、生物医学和生物多样性文本源中提取有用信息的文本挖掘任务的重要步骤。学名是链接生物信息的重要元数据元素。
我们提出了 NetiNeti(从文本信息中提取名称-分类索引的名称提取),这是一种基于机器学习的方法,用于识别包括从文本中发现新物种名称的科学名称,也可以处理拼写错误、OCR 错误和名称的其他变体。该系统使用科学名称规则生成候选名称,并应用概率机器学习方法根据候选名称的结构特征和从其上下文中提取的特征对名称进行分类。NetiNeti 还可以使用上下文信息从其他名称中区分科学名称。我们在遗留生物多样性文本和生物医学文献(MEDLINE)上评估了 NetiNeti。与基于流行字典的方法(精度=97.5%,召回率=54.3%)相比,NetiNeti 在 600 页由注释者手动标记的生物多样性书籍上的表现更好(精度=98.9%,召回率=70.5%)。在一个由 PubMed Central 的全文文章组成的小集合上,用科学名称进行注释,精度和召回率分别为 98.5%和 96.2%。当在整个 MEDLINE 数据库上使用时,NetiNeti 在超过 188 万 PubMed 记录中发现了超过 19 万个独特的二项式和三项式名称。NetiNeti 还成功识别了网页中提到的几乎所有新物种名称。
我们提出了 NetiNeti,这是一种基于机器学习的识别和发现科学名称的方法。实现该方法的系统可在 http://namefinding.ubio.org 访问。