Huang Ying, Li Yanda
State Key Laboratory of Intelligent Technology and Systems, Department of Automation, Institute of Bioinformatics, Tsinghua University, Beijing 100084, People's Republic of China.
Bioinformatics. 2004 Jan 1;20(1):21-8. doi: 10.1093/bioinformatics/btg366.
Protein localization data are a valuable information resource helpful in elucidating protein functions. It is highly desirable to predict a protein's subcellular locations automatically from its sequence.
In this paper, fuzzy k-nearest neighbors (k-NN) algorithm has been introduced to predict proteins' subcellular locations from their dipeptide composition. The prediction is performed with a new data set derived from version 41.0 SWISS-PROT databank, the overall predictive accuracy about 80% has been achieved in a jackknife test. The result demonstrates the applicability of this relative simple method and possible improvement of prediction accuracy for the protein subcellular locations. We also applied this method to annotate six entirely sequenced proteomes, namely Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Oryza sativa, Arabidopsis thaliana and a subset of all human proteins.
Supplementary information and subcellular location annotations for eukaryotes are available at http://166.111.30.65/hying/fuzzy_loc.htm
蛋白质定位数据是有助于阐明蛋白质功能的宝贵信息资源。非常希望能从蛋白质序列自动预测其亚细胞定位。
本文引入模糊k近邻(k-NN)算法,根据二肽组成预测蛋白质的亚细胞定位。预测使用从第41.0版SWISS-PROT数据库导出的新数据集,在留一法检验中总体预测准确率达到了约80%。结果证明了这种相对简单方法的适用性以及蛋白质亚细胞定位预测准确率的可能提高。我们还将此方法应用于注释六个全序列蛋白质组,即酿酒酵母、秀丽隐杆线虫、黑腹果蝇、水稻、拟南芥以及所有人类蛋白质的一个子集。
真核生物的补充信息和亚细胞定位注释可在http://166.111.30.65/hying/fuzzy_loc.htm获取