Nair Rajesh, Rost Burkhard
CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
Bioinformatics. 2002;18 Suppl 1:S78-86. doi: 10.1093/bioinformatics/18.suppl_1.s78.
The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is available for only a few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.
The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for fewer than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.
SWISS-PROT序列数据库包含许多蛋白质的功能注释关键词。相比之下,只有少数蛋白质具有亚细胞定位信息。专家通常可以从描述蛋白质功能的关键词中推断出定位。我们开发了LOCkey,这是一种用于对SWISS-PROT关键词进行词汇分析以分配亚细胞定位的全自动方法。随着序列数据的快速增长,序列的生化表征已经落后。我们的方法可能是补充已自动获得的功能信息的有用工具。
在全交叉验证测试中,该方法的准确率达到了82%以上。由于缺乏功能注释,我们只能推断出SWISS-PROT中不到一半蛋白质的定位。我们应用LOCkey对五个完全测序的蛋白质组进行注释,即酿酒酵母(酵母)、秀丽隐杆线虫(线虫)、黑腹果蝇(果蝇)、拟南芥(植物)以及所有人类蛋白质的一个子集。LOCkey为这些真核生物发现了约8000个新的亚细胞定位注释。