Suppr超能文献

NLProt:从论文中提取蛋白质名称和序列。

NLProt: extracting protein names and sequences from papers.

作者信息

Mika Sven, Rost Burkhard

机构信息

CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.

出版信息

Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W634-7. doi: 10.1093/nar/gkh427.

Abstract

Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.

摘要

从文献中自动提取蛋白质名称并将这些名称与序列数据库中的相关条目进行链接,对于注释生物数据库而言正变得越来越重要。NLProt是一个新颖的系统,它将基于词典和规则的过滤与多个支持向量机(SVM)相结合,以在PubMed摘要中标记蛋白质名称。当将部分标记的名称视为错误时,NLProt在召回率为76%的情况下仍达到了75%的精确率。按照许多标准,我们的系统显著优于其他标记方法;特别是,即使对于新名称,它也被证明非常可靠。在果蝇中特别频繁出现的名称,如白色、翅膀和奇异,构成了NLProt的一个明显局限。我们的方法既可以作为互联网服务器使用,也可以作为下载程序使用(http://cubic.bioc.columbia.edu/services/NLProt/)。输入可以是PubMed/MEDLINE标识符、作者、标题和期刊,以及摘要集合或整篇论文。

相似文献

1
NLProt: extracting protein names and sequences from papers.NLProt:从论文中提取蛋白质名称和序列。
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W634-7. doi: 10.1093/nar/gkh427.
2
Protein names precisely peeled off free text.蛋白质名称从自由文本中精确提取。
Bioinformatics. 2004 Aug 4;20 Suppl 1:i241-7. doi: 10.1093/bioinformatics/bth904.
4
Rutabaga by any other name: extracting biological names.换个名字的芜菁:提取生物名称。
J Biomed Inform. 2002 Aug;35(4):247-59. doi: 10.1016/s1532-0464(03)00014-5.
5
Gene name ambiguity of eukaryotic nomenclatures.真核生物命名法中的基因名称歧义。
Bioinformatics. 2005 Jan 15;21(2):248-56. doi: 10.1093/bioinformatics/bth496. Epub 2004 Aug 27.
7
The Protein Naming Utility: a rules database for protein nomenclature.蛋白质命名实用程序:蛋白质命名规则数据库。
Nucleic Acids Res. 2010 Jan;38(Database issue):D336-9. doi: 10.1093/nar/gkp958. Epub 2009 Dec 8.
8
PubServer: literature searches by homology.PubServer:同源文献检索。
Nucleic Acids Res. 2014 Jul;42(Web Server issue):W430-5. doi: 10.1093/nar/gku450. Epub 2014 Jun 23.
10
SNAD: Sequence Name Annotation-based Designer.SNAD:基于序列名称注释的设计器。
BMC Bioinformatics. 2009 Aug 14;10:251. doi: 10.1186/1471-2105-10-251.

引用本文的文献

4
Chapter 15: disease gene prioritization.第 15 章:疾病基因优先级排序。
PLoS Comput Biol. 2013 Apr;9(4):e1002902. doi: 10.1371/journal.pcbi.1002902. Epub 2013 Apr 25.
7
Bioinformatics for personal genome interpretation.个人基因组解读的生物信息学。
Brief Bioinform. 2012 Jul;13(4):495-512. doi: 10.1093/bib/bbr070. Epub 2012 Jan 13.

本文引用的文献

1
2
UniProt: the Universal Protein knowledgebase.通用蛋白质知识库(UniProt)。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D115-9. doi: 10.1093/nar/gkh131.
5
A biological named entity recognizer.一个生物命名实体识别器。
Pac Symp Biocomput. 2003:427-38. doi: 10.1142/9789812776303_0040.
7
Protein names and how to find them.蛋白质名称及其查找方法。
Int J Med Inform. 2002 Dec 4;67(1-3):49-61. doi: 10.1016/s1386-5056(02)00052-7.
8
Tagging gene and protein names in biomedical text.在生物医学文本中标记基因和蛋白质名称。
Bioinformatics. 2002 Aug;18(8):1124-32. doi: 10.1093/bioinformatics/18.8.1124.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验