通过最小化歧义性和变异性来规范生物医学术语。

Normalizing biomedical terms by minimizing ambiguity and variability.

作者信息

Tsuruoka Yoshimasa, McNaught John, Ananiadou Sophia

机构信息

School of Computer Science, The University of Manchester, MIB, 131 Princess Street, Manchester, M1 7DN, UK.

出版信息

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2105-9-S3-S2.

DOI:10.1186/1471-2105-9-S3-S2

PMID:18426547

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2352870/

Abstract

BACKGROUND

One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach.

RESULTS

We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS.

CONCLUSIONS

The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.

摘要

背景

将生物医学命名实体（如基因、蛋白质、化学物质和疾病）映射到其概念标识符的困难之一源于术语的潜在变异性。软字符串匹配是解决该问题的一种可能方法，但其固有的高计算成本使其在词典较大或需要实时处理时不被采用。一种计算要求较低的方法是使用启发式规则对术语进行规范化，这使我们能够在固定时间内查找词典，而不管其大小如何。然而，制定良好的启发式规则需要对相关术语有广泛的了解，因此这是规范化方法的瓶颈。

结果

我们提出了一个新颖的框架，用于以完全自动化的方式从词典中发现规范化规则列表。所发现的规则能够最大程度地减少词典中术语的歧义性和变异性。我们使用两个大型词典对我们的算法进行了评估：一个是基于生物词库构建的人类基因/蛋白质名称词典，另一个是基于统一医学语言系统构建的疾病名称词典。

结论

实验结果表明，自动发现的规则在术语映射任务中能够与精心制定的启发式规则相媲美，并且规则应用的计算开销足够小，从而可以实现非常快速的实现。这项工作将有助于提高生物医学信息提取中术语-概念映射任务的性能，特别是在尚未完全了解目标术语的良好规范化启发式方法时。

相似文献

Normalizing biomedical terms by minimizing ambiguity and variability.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2105-9-S3-S2.

Assessment of disease named entity recognition on a corpus of annotated sentences.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature.

In Silico Biol. 2005;5(1):67-79.

Unsupervised method for automatic construction of a disease dictionary from a large free text collection.

AMIA Annu Symp Proc. 2008 Nov 6;2008:820-4.

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression.

Bioinformatics. 2007 Oct 15;23(20):2768-74. doi: 10.1093/bioinformatics/btm393. Epub 2007 Aug 12.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

How to make the most of NE dictionaries in statistical NER.

BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S5. doi: 10.1186/1471-2105-9-S11-S5.

A modular framework for biomedical concept recognition.

BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.

The MedDRA paradox.

AMIA Annu Symp Proc. 2008 Nov 6;2008:470-4.

Word sense disambiguation via semantic type classification.

AMIA Annu Symp Proc. 2008 Nov 6;2008:177-81.

引用本文的文献

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.

BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.

Text Mining the History of Medicine.

PLoS One. 2016 Jan 6;11(1):e0144717. doi: 10.1371/journal.pone.0144717. eCollection 2016.

Quantifying the impact and extent of undocumented biomedical synonymy.

PLoS Comput Biol. 2014 Sep 25;10(9):e1003799. doi: 10.1371/journal.pcbi.1003799. eCollection 2014 Sep.

Event-based text mining for biology and functional genomics.

Brief Funct Genomics. 2015 May;14(3):213-30. doi: 10.1093/bfgp/elu015. Epub 2014 Jun 6.

Evaluation and cross-comparison of lexical entities of biological interest (LexEBI).

PLoS One. 2013 Oct 4;8(10):e75185. doi: 10.1371/journal.pone.0075185. eCollection 2013.

Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources.

J Biomed Semantics. 2013 Oct 11;4(1):28. doi: 10.1186/2041-1480-4-28.

Using rule-based natural language processing to improve disease normalization in biomedical text.

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):876-81. doi: 10.1136/amiajnl-2012-001173. Epub 2012 Oct 6.

The BioLexicon: a large-scale terminological resource for biomedical text mining.

BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397.

Natural language query in the biochemistry and molecular biology domains based on cognition search™.

Summit Transl Bioinform. 2009 Mar 1;2009:32-7.

Methods for managing variation in clinical drug names.

AMIA Annu Symp Proc. 2010 Nov 13;2010:637-41.

本文引用的文献

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression.

Bioinformatics. 2007 Oct 15;23(20):2768-74. doi: 10.1093/bioinformatics/btm393. Epub 2007 Aug 12.

The Universal Protein Resource (UniProt).

Nucleic Acids Res. 2007 Jan;35(Database issue):D193-7. doi: 10.1093/nar/gkl929. Epub 2006 Nov 16.

A scalable machine-learning approach to recognize chemical names within large text databases.

BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S3. doi: 10.1186/1471-2105-7-S2-S3.

Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification.

J Biomed Inform. 2007 Jun;40(3):316-24. doi: 10.1016/j.jbi.2006.09.002. Epub 2006 Sep 24.

A graph-search framework for associating gene identifiers with documents.

BMC Bioinformatics. 2006 Oct 10;7:440. doi: 10.1186/1471-2105-7-440.

Quantitative assessment of dictionary-based protein named entity tagging.

J Am Med Inform Assoc. 2006 Sep-Oct;13(5):497-507. doi: 10.1197/jamia.M2085. Epub 2006 Jun 23.

BioThesaurus: a web-based thesaurus of protein and gene names.

Bioinformatics. 2006 Jan 1;22(1):103-5. doi: 10.1093/bioinformatics/bti749. Epub 2005 Nov 2.

GENETAG: a tagged corpus for gene/protein named entity recognition.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. Epub 2005 May 24.

ProMiner: rule-based protein and gene entity recognition.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S14. doi: 10.1186/1471-2105-6-S1-S14. Epub 2005 May 24.

Overview of BioCreAtIvE task 1B: normalized gene lists.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过最小化歧义性和变异性来规范生物医学术语。

Normalizing biomedical terms by minimizing ambiguity and variability.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献