Park Keun-Joon, Kanehisa Minoru
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan.
Bioinformatics. 2003 Sep 1;19(13):1656-63. doi: 10.1093/bioinformatics/btg222.
The subcellular location of a protein is closely correlated to its function. Thus, computational prediction of subcellular locations from the amino acid sequence information would help annotation and functional prediction of protein coding genes in complete genomes. We have developed a method based on support vector machines (SVMs).
We considered 12 subcellular locations in eukaryotic cells: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracellular medium, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, and vacuole. We constructed a data set of proteins with known locations from the SWISS-PROT database. A set of SVMs was trained to predict the subcellular location of a given protein based on its amino acid, amino acid pair, and gapped amino acid pair compositions. The predictors based on these different compositions were then combined using a voting scheme. Results obtained through 5-fold cross-validation tests showed an improvement in prediction accuracy over the algorithm based on the amino acid composition only. This prediction method is available via the Internet.
蛋白质的亚细胞定位与其功能密切相关。因此,根据氨基酸序列信息对亚细胞定位进行计算预测将有助于对完整基因组中蛋白质编码基因进行注释和功能预测。我们开发了一种基于支持向量机(SVM)的方法。
我们考虑了真核细胞中的12个亚细胞定位:叶绿体、细胞质、细胞骨架、内质网、细胞外介质、高尔基体、溶酶体、线粒体、细胞核、过氧化物酶体、质膜和液泡。我们从SWISS-PROT数据库构建了一个具有已知定位的蛋白质数据集。训练了一组支持向量机,以根据给定蛋白质的氨基酸、氨基酸对和带间隔的氨基酸对组成来预测其亚细胞定位。然后使用投票方案将基于这些不同组成的预测器进行组合。通过5折交叉验证测试获得的结果表明,与仅基于氨基酸组成的算法相比,预测准确性有所提高。这种预测方法可通过互联网获得。