AZuRE，一个用于基因和蛋白质名称自动消歧的可扩展系统。

AZuRE, a scalable system for automated term disambiguation of gene and protein names.

作者信息

Podowski Raf M, Cleary John G, Goncharoff Nicholas T, Amoutzias Gregory, Hayes William S

机构信息

AstraZeneca R&D Boston and Karolinska Institutet.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2004:415-24. doi: 10.1109/csb.2004.1332454.

DOI:10.1109/csb.2004.1332454

PMID:16448034

Abstract

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.

摘要

由于缺乏标准的基因和蛋白质命名规范，研究人员在进行文献检索时往往要耗费很长时间，有时甚至徒劳无功。本文描述了一种系统，该系统能够在之前未见过的MEDLINE摘要中自动为基因指定其基因定位链接数据库标识（LLID）。该系统基于监督学习，为每个LLID构建一个模型。所有LLID的训练集均自动从基因定位链接数据库和瑞士蛋白质数据库中的MEDLINE参考文献中提取。对所有20546个具有LLID的人类基因的性能进行了验证。其中，7344个产生了高质量模型（F值>0.7，其中近60%大于0.9），13202个没有，主要是由于已知文献参考数量不足。对一组66个基因的MEDLINE文档进行人工验证，结果与系统的内部准确性评估结果高度一致。结论是，使用可扩展的自动化技术可以实现高质量的基因消歧。

相似文献

AZuRE, a scalable system for automated term disambiguation of gene and protein names.AZuRE，一个用于基因和蛋白质名称自动消歧的可扩展系统。

Proc IEEE Comput Syst Bioinform Conf. 2004:415-24. doi: 10.1109/csb.2004.1332454.

Suregene, a scalable system for automated term disambiguation of gene and protein names.Suregene是一个用于对基因和蛋白质名称进行自动消歧的可扩展系统。

J Bioinform Comput Biol. 2005 Jun;3(3):743-70. doi: 10.1142/s0219720005001223.

GPSDB: a new database for synonyms expansion of gene and protein names.GPSDB：一个用于基因和蛋白质名称同义词扩展的新数据库。

Bioinformatics. 2005 Apr 15;21(8):1743-4. doi: 10.1093/bioinformatics/bti235. Epub 2004 Dec 21.

GeneInfoMiner--a web server for exploring biomedical literature using batch sequence ID.基因信息挖掘器——一个使用批量序列ID探索生物医学文献的网络服务器。

Bioinformatics. 2005 Aug 15;21(16):3452-3. doi: 10.1093/bioinformatics/bti559. Epub 2005 Jun 30.

Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection.LocusLink和MEDLINE中人类基因符号的歧义性：创建清单和消歧测试集。

AMIA Annu Symp Proc. 2003;2003:704-8.

Building a protein name dictionary from full text: a machine learning term extraction approach.从全文构建蛋白质名称词典：一种机器学习术语提取方法。

BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.

Gene symbol disambiguation using knowledge-based profiles.使用基于知识的概况进行基因符号消歧。

Bioinformatics. 2007 Apr 15;23(8):1015-22. doi: 10.1093/bioinformatics/btm056. Epub 2007 Feb 21.

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.ABNER：一种用于在文本中自动标记基因、蛋白质及其他实体名称的开源工具。

Bioinformatics. 2005 Jul 15;21(14):3191-2. doi: 10.1093/bioinformatics/bti475. Epub 2005 Apr 28.

Thesaurus-based disambiguation of gene symbols.基于词库的基因符号消歧

BMC Bioinformatics. 2005 Jun 16;6:149. doi: 10.1186/1471-2105-6-149.

BioIE: extracting informative sentences from the biomedical literature.生物信息抽取：从生物医学文献中提取信息性句子。

Bioinformatics. 2005 May 1;21(9):2138-9. doi: 10.1093/bioinformatics/bti296. Epub 2005 Feb 2.

引用本文的文献

Identifying the status of genetic lesions in cancer clinical trial documents using machine learning.利用机器学习识别癌症临床试验文件中的遗传病变状态。

BMC Genomics. 2012;13 Suppl 8(Suppl 8):S21. doi: 10.1186/1471-2164-13-S8-S21. Epub 2012 Dec 17.

Disclosing ambiguous gene aliases by automatic literature profiling.自动文献分析揭示模糊的基因别名。

BMC Genomics. 2010 Dec 22;11 Suppl 5(Suppl 5):S3. doi: 10.1186/1471-2164-11-S5-S3.

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.从文献中的基因列表中识别过度表达的概念：一种基于泊松混合模型的统计方法。

BMC Bioinformatics. 2010 May 20;11:272. doi: 10.1186/1471-2105-11-272.

The strength of co-authorship in gene name disambiguation.共同作者在基因名称消歧中的作用强度。

BMC Bioinformatics. 2008 Jan 29;9:69. doi: 10.1186/1471-2105-9-69.

Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues.生物医学领域中的机器学习与词义消歧：设计与评估问题

BMC Bioinformatics. 2006 Jul 5;7:334. doi: 10.1186/1471-2105-7-334.

Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation.文献挖掘中支持向量机的上下文加权：在基因与蛋白质名称消歧中的应用

BMC Bioinformatics. 2005 Jun 22;6:157. doi: 10.1186/1471-2105-6-157.

Thesaurus-based disambiguation of gene symbols.基于词库的基因符号消歧

BMC Bioinformatics. 2005 Jun 16;6:149. doi: 10.1186/1471-2105-6-149.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

AZuRE，一个用于基因和蛋白质名称自动消歧的可扩展系统。

AZuRE, a scalable system for automated term disambiguation of gene and protein names.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献