Mika Sven, Rost Burkhard
CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W634-7. doi: 10.1093/nar/gkh427.
Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.
从文献中自动提取蛋白质名称并将这些名称与序列数据库中的相关条目进行链接,对于注释生物数据库而言正变得越来越重要。NLProt是一个新颖的系统,它将基于词典和规则的过滤与多个支持向量机(SVM)相结合,以在PubMed摘要中标记蛋白质名称。当将部分标记的名称视为错误时,NLProt在召回率为76%的情况下仍达到了75%的精确率。按照许多标准,我们的系统显著优于其他标记方法;特别是,即使对于新名称,它也被证明非常可靠。在果蝇中特别频繁出现的名称,如白色、翅膀和奇异,构成了NLProt的一个明显局限。我们的方法既可以作为互联网服务器使用,也可以作为下载程序使用(http://cubic.bioc.columbia.edu/services/NLProt/)。输入可以是PubMed/MEDLINE标识符、作者、标题和期刊,以及摘要集合或整篇论文。