使用逻辑回归学习用于基因/蛋白质名称字典查找的字符串相似性度量。

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression.

作者信息

Tsuruoka Yoshimasa, McNaught John, Tsujii Jun'ichi, Ananiadou Sophia

机构信息

School of Computer Science, The University of Manchester, Manchester, UK.

出版信息

Bioinformatics. 2007 Oct 15;23(20):2768-74. doi: 10.1093/bioinformatics/btm393. Epub 2007 Aug 12.

DOI:10.1093/bioinformatics/btm393

PMID:17698493

Abstract

MOTIVATION

One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed.

RESULTS

We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks.

AVAILABILITY

A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.

摘要

动机

生物医学数据整合的瓶颈之一是术语的变化。由于名称看似微小的差异，精确字符串匹配常常无法将一个名称与其生物学概念（即数据库中的ID或登录号）相关联。软字符串匹配有可能通过考虑名称之间的相似性来帮助我们找到相关的ID。然而，软匹配的准确性高度依赖于所采用的相似性度量。

结果

我们使用逻辑回归从字典中学习字符串相似性度量。使用几个大规模基因/蛋白质名称字典进行的实验表明，基于逻辑回归的相似性度量在字典查找任务中优于现有的相似性度量。

可用性

可通过http://text0.mib.man.ac.uk/software/mldic/获取使用本文所述相似性度量的字典查找系统。

相似文献

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression.

Bioinformatics. 2007 Oct 15;23(20):2768-74. doi: 10.1093/bioinformatics/btm393. Epub 2007 Aug 12.

Building a protein name dictionary from full text: a machine learning term extraction approach.

BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.

Recognizing names in biomedical texts: a machine learning approach.

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

Improving the performance of dictionary-based approaches in protein name recognition.

J Biomed Inform. 2004 Dec;37(6):461-70. doi: 10.1016/j.jbi.2004.08.003.

Normalizing biomedical terms by minimizing ambiguity and variability.

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2105-9-S3-S2.

Gene name ambiguity of eukaryotic nomenclatures.

Bioinformatics. 2005 Jan 15;21(2):248-56. doi: 10.1093/bioinformatics/bth496. Epub 2004 Aug 27.

Discriminative application of string similarity methods to chemical and non-chemical names for biomedical abbreviation clustering.

BMC Genomics. 2012 Jun 11;13 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2164-13-S3-S8.

Two learning approaches for protein name extraction.

J Biomed Inform. 2009 Dec;42(6):1046-55. doi: 10.1016/j.jbi.2009.05.004. Epub 2009 May 13.

A probabilistic model for identifying protein names and their name boundaries.

Proc IEEE Comput Soc Bioinform Conf. 2003;2:251-8.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.

Bioinformatics. 2003 Jul 1;19(10):1275-83. doi: 10.1093/bioinformatics/btg153.

引用本文的文献

A Multitask Deep Learning Framework for DNER.

Comput Intell Neurosci. 2022 Apr 16;2022:3321296. doi: 10.1155/2022/3321296. eCollection 2022.

C-Norm: a neural approach to few-shot entity normalization.

BMC Bioinformatics. 2020 Dec 29;21(Suppl 23):579. doi: 10.1186/s12859-020-03886-8.

Hybrid Semantic Analysis for Mapping Adverse Drug Reaction Mentions in Tweets to Medical Terminology.

AMIA Annu Symp Proc. 2018 Apr 16;2017:679-688. eCollection 2017.

Sorting Through the Safety Data Haystack: Using Machine Learning to Identify Individual Case Safety Reports in Social-Digital Media.

Drug Saf. 2018 Jun;41(6):579-590. doi: 10.1007/s40264-018-0641-7.

Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text.

Yearb Med Inform. 2017 Aug;26(1):214-227. doi: 10.15265/IY-2017-029. Epub 2017 Sep 11.

Constructing a biodiversity terminological inventory.

PLoS One. 2017 Apr 17;12(4):e0175277. doi: 10.1371/journal.pone.0175277. eCollection 2017.

TaggerOne: joint named entity recognition and normalization with semi-Markov Models.

Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.

Text Mining the History of Medicine.

PLoS One. 2016 Jan 6;11(1):e0144717. doi: 10.1371/journal.pone.0144717. eCollection 2016.

KneeTex: an ontology-driven system for information extraction from MRI reports.

J Biomed Semantics. 2015 Sep 7;6:34. doi: 10.1186/s13326-015-0033-1. eCollection 2015.

PathNER: a tool for systematic identification of biological pathway mentions in the literature.

BMC Syst Biol. 2013 Oct 16;7 Suppl 3(Suppl 3):S2. doi: 10.1186/1752-0509-7-S3-S2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用逻辑回归学习用于基因/蛋白质名称字典查找的字符串相似性度量。

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献