Shen Hong-Bin, Chou Kuo-Chen
Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, 1954 Hua-Shan Road, Shanghai 200030, China.
Protein Eng Des Sel. 2007 Jan;20(1):39-46. doi: 10.1093/protein/gzl053. Epub 2007 Jan 23.
A statistical analysis indicated that, of the 35,016 Gram-positive bacterial proteins from the recent Swiss-Prot database, approximately 57% of these entries are without subcellular location annotations. In the gene ontology database, the corresponding percentage is approximately 67%, meaning the percentage of proteins without subcellular component annotations is even higher. With the avalanche of gene products generated in the post-genomic era, the number of such location-unknown entries will continuously increase. It is highly desired to develop an automated method for timely and accurately identifying their subcellular localization because the information thus obtained is very useful for both basic research and drug discovery practice. In view of this, an ensemble classifier called 'Gpos-PLoc' was developed for predicting Gram-positive protein subcellular localization. The new predictor is featured by fusing many basic classifiers, each of which was engineered according to the optimized evidence-theoretic K-nearest neighbors rule. As a demonstration, tests were performed on Gram-positive proteins among the following five subcellular location sites: (1) cell wall, (2) cytoplasm, (3) extracell, (4) periplasm and (5) plasma membrane. To eliminate redundancy and homology bias, only those proteins which have < 25% sequence identity to any other in a same subcellular location were allowed to be included in the benchmark datasets. The overall success rates thus achieved by Gpos-PLoc were > 80% for both jackknife cross-validation test and independent dataset test, implying that Gpos-PLoc might become a very useful vehicle for expediting the analysis of Gram-positive bacterial proteins. Gpos-PLoc is freely accessible to public as a web-server at http://202.120.37.186/bioinf/Gpos/. To support the need of many investigators in the relevant areas, a downloadable file is provided at the same website to list the results identified by Gpos-PLoc for 31,898 Gram-positive bacterial protein entries in Swiss-Prot database that either have no subcellular location annotation or are annotated with uncertain terms such as 'probable', 'potential', 'perhaps' and 'by similarity'. Such large-scale results will be updated once a year to include the new entries of Gram-positive bacterial proteins and reflect the continuous development of Gpos-PLoc.
一项统计分析表明,在近期的瑞士蛋白质数据库(Swiss-Prot database)中的35016种革兰氏阳性细菌蛋白质中,约57%的条目没有亚细胞定位注释。在基因本体数据库(gene ontology database)中,相应的百分比约为67%,这意味着没有亚细胞成分注释的蛋白质百分比更高。随着后基因组时代产生的基因产物大量涌现,此类定位未知条目的数量将持续增加。迫切需要开发一种自动化方法,以便及时、准确地识别它们的亚细胞定位,因为由此获得的信息对基础研究和药物发现实践都非常有用。有鉴于此,开发了一种名为“Gpos-PLoc”的集成分类器,用于预测革兰氏阳性蛋白质的亚细胞定位。新的预测器的特点是融合了许多基本分类器,每个基本分类器都是根据优化的证据理论K近邻规则设计的。作为演示,对以下五个亚细胞定位位点中的革兰氏阳性蛋白质进行了测试:(1)细胞壁,(2)细胞质,(3)细胞外,(4)周质和(5)质膜。为了消除冗余和同源性偏差,只有那些与同一亚细胞定位中的任何其他蛋白质序列同一性小于25%的蛋白质才被允许纳入基准数据集。通过留一法交叉验证测试和独立数据集测试,Gpos-PLoc获得的总体成功率均超过80%,这意味着Gpos-PLoc可能成为加速革兰氏阳性细菌蛋白质分析的非常有用的工具。Gpos-PLoc作为一个网络服务器可在http://202.120.37.186/bioinf/Gpos/上免费供公众使用。为了满足相关领域许多研究人员的需求,同一网站提供了一个可下载文件,列出了Gpos-PLoc对瑞士蛋白质数据库中31898个革兰氏阳性细菌蛋白质条目的识别结果,这些条目要么没有亚细胞定位注释,要么用“可能”“潜在”“也许”和“相似性”等不确定术语注释。如此大规模的结果将每年更新一次,以纳入革兰氏阳性细菌蛋白质的新条目,并反映Gpos-PLoc的持续发展。