利用自然语言解析器对生物医学命名实体进行消歧。

Disambiguating the species of biomedical named entities using natural language parsers.

机构信息

National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK.

出版信息

Bioinformatics. 2010 Mar 1;26(5):661-7. doi: 10.1093/bioinformatics/btq002. Epub 2010 Jan 6.

DOI:10.1093/bioinformatics/btq002

PMID:20053840

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2828111/

Abstract

MOTIVATION

Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers.

RESULTS

We build a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID, and evaluate a number of methods on it. Promising results are obtained by training a machine learning model on syntactic parse trees, which is then used to decide whether an entity belongs to the model organism denoted by a neighbouring species-indicating word (e.g. yeast). The parser-based approaches are also compared with a supervised classification method and results indicate that the former are a more favorable choice when domain portability is of concern. The best overall performance is obtained by combining the strengths of syntactic features and supervised classification.

AVAILABILITY

The corpus and demo are available at http://www.nactem.ac.uk/deca_details/start.cgi, and the software is freely available as U-Compare components (Kano et al., 2009): NaCTeM Species Word Detector and NaCTeM Species Disambiguator. U-Compare is available at http://-compare.org/

摘要

动机

文本挖掘技术已被证明可以减少组织文献中隐藏的大量信息所涉及的繁琐工作。文本挖掘中的一个挑战是将模糊的词形与明确的生物概念联系起来。本文报道了一项关于解决生物医学命名实体提及中与模式生物有关的歧义的综合研究，并提出了一系列方法，重点是利用自然语言解析器的方法。

结果

我们构建了一个用于生物分类歧义消解的语料库，其中蛋白质/基因实体的每个出现都被手动标记为物种 ID，并在其上评估了多种方法。通过对句法解析树进行机器学习模型训练，获得了有希望的结果，然后使用该模型来确定实体是否属于由相邻物种指示词（例如酵母）表示的模式生物。基于解析器的方法也与有监督的分类方法进行了比较，结果表明，当关注领域可移植性时，前者是更可取的选择。通过结合句法特征和有监督分类的优势，可以获得最佳的整体性能。

可用性

语料库和演示可在 http://www.nactem.ac.uk/deca_details/start.cgi 上获得，软件可作为 U-Compare 组件免费获得（Kano 等人，2009）：NaCTeM 物种词检测器和 NaCTeM 物种消解器。U-Compare 可在 http://-compare.org/ 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a6a/2828111/5320664c5595/btq002f1.jpg

相似文献

Disambiguating the species of biomedical named entities using natural language parsers.利用自然语言解析器对生物医学命名实体进行消歧。

Bioinformatics. 2010 Mar 1;26(5):661-7. doi: 10.1093/bioinformatics/btq002. Epub 2010 Jan 6.

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.生物标记器：在生物医学文献中检测、规范和定位生物实体。

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

SimConcept: a hybrid approach for simplifying composite named entities in biomedical text.SimConcept：一种简化生物医学文本中复合命名实体的混合方法。

IEEE J Biomed Health Inform. 2015 Jul;19(4):1385-91. doi: 10.1109/JBHI.2015.2422651. Epub 2015 Apr 13.

Distinguishing the species of biomedical named entities for term identification.区分生物医学命名实体的物种以进行术语识别。

BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S6. doi: 10.1186/1471-2105-9-S11-S6.

Linking entities through an ontology using word embeddings and syntactic re-ranking.通过使用词向量和句法重新排序将实体链接到本体中。

BMC Bioinformatics. 2019 Mar 27;20(1):156. doi: 10.1186/s12859-019-2678-8.

Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称：一种机器学习方法。

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

Knowledge based word-concept model estimation and refinement for biomedical text mining.用于生物医学文本挖掘的基于知识的词概念模型估计与优化。

J Biomed Inform. 2015 Feb;53:300-7. doi: 10.1016/j.jbi.2014.11.015. Epub 2014 Dec 12.

Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification.基于知识的生物医学词义消歧：评估及在临床文档分类中的应用。

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):882-6. doi: 10.1136/amiajnl-2012-001350. Epub 2012 Oct 16.

引用本文的文献

Thalia: semantic search engine for biomedical abstracts.塔利亚：生物医学文摘的语义搜索引擎。

Bioinformatics. 2019 May 15;35(10):1799-1801. doi: 10.1093/bioinformatics/bty871.

Transfer learning for biomedical named entity recognition with neural networks.基于神经网络的生物医学命名实体识别的迁移学习。

Bioinformatics. 2018 Dec 1;34(23):4087-4094. doi: 10.1093/bioinformatics/bty449.

Deep learning with word embeddings improves biomedical named entity recognition.使用词嵌入的深度学习可改善生物医学命名实体识别。

Bioinformatics. 2017 Jul 15;33(14):i37-i48. doi: 10.1093/bioinformatics/btx228.

Constructing a biodiversity terminological inventory.构建生物多样性术语库

PLoS One. 2017 Apr 17;12(4):e0175277. doi: 10.1371/journal.pone.0175277. eCollection 2017.

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions.用于识别和规范化物种以及基因/蛋白质提及的新南威尔士大学（UNSW）生物信息学模块。

Database (Oxford). 2016 Jul 27;2016. doi: 10.1093/database/baw111. Print 2016.

Text-mining-assisted biocuration workflows in Argo.阿尔戈中基于文本挖掘的生物编目工作流程。

Database (Oxford). 2014 Jul 18;2014. doi: 10.1093/database/bau070. Print 2014.

Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources.根据基因/蛋白质标记解决方案和词汇资源评估金标准语料库。

J Biomed Semantics. 2013 Oct 11;4(1):28. doi: 10.1186/2041-1480-4-28.

A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text.一种通过从文本中挖掘反应来整合和对生物化学途径证据进行排序的方法。

Bioinformatics. 2013 Jul 1;29(13):i44-52. doi: 10.1093/bioinformatics/btt227.

NetiNeti: discovery of scientific names from text using machine learning methods.内提内提：使用机器学习方法从文本中发现科学名称。

BMC Bioinformatics. 2012 Aug 22;13:211. doi: 10.1186/1471-2105-13-211.

SR4GN: a species recognition software tool for gene normalization.SR4GN：一种用于基因标准化的物种识别软件工具。

PLoS One. 2012;7(6):e38460. doi: 10.1371/journal.pone.0038460. Epub 2012 Jun 5.

本文引用的文献

U-Compare: share and compare text mining tools with UIMA.U-Compare：与 UIMA 共享和比较文本挖掘工具。

Bioinformatics. 2009 Aug 1;25(15):1997-8. doi: 10.1093/bioinformatics/btp289. Epub 2009 May 4.

Evaluating contributions of natural language parsers to protein-protein interaction extraction.评估自然语言解析器对蛋白质-蛋白质相互作用提取的贡献。

Bioinformatics. 2009 Feb 1;25(3):394-400. doi: 10.1093/bioinformatics/btn631. Epub 2008 Dec 9.

Distinguishing the species of biomedical named entities for term identification.区分生物医学命名实体的物种以进行术语识别。

BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S6. doi: 10.1186/1471-2105-9-S11-S6.

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.生物学文本挖掘系统评估：第二届生物创意社区挑战赛概述

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

Inter-species normalization of gene mentions with GNAT.使用GNAT对基因提及进行种间标准化。

Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.

Assisted curation: does text mining really help?辅助编目：文本挖掘真的有帮助吗？

Pac Symp Biocomput. 2008:556-67.

Text mining and its potential applications in systems biology.文本挖掘及其在系统生物学中的潜在应用。

Trends Biotechnol. 2006 Dec;24(12):571-9. doi: 10.1016/j.tibtech.2006.10.002. Epub 2006 Oct 12.

Biomedical language processing: what's beyond PubMed?生物医学语言处理：超越PubMed的是什么？

Mol Cell. 2006 Mar 3;21(5):589-94. doi: 10.1016/j.molcel.2006.02.012.

Data preparation and interannotator agreement: BioCreAtIvE task 1B.数据准备与注释者间一致性：生物创意任务1B

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2105-6-S1-S12. Epub 2005 May 24.

Overview of BioCreAtIvE task 1B: normalized gene lists.生物创意任务1B概述：标准化基因列表。

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用自然语言解析器对生物医学命名实体进行消歧。

Disambiguating the species of biomedical named entities using natural language parsers.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献