Matsuda Setsuro, Vert Jean-Philippe, Saigo Hiroto, Ueda Nobuhisa, Toh Hiroyuki, Akutsu Tatsuya
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan.
Protein Sci. 2005 Nov;14(11):2804-13. doi: 10.1110/ps.051597405.
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.
随着完整基因组数量的迅速增加,能够自动预测蛋白质亚细胞定位的精确方法对于帮助进行蛋白质功能注释变得越来越有用。为了提高迄今为止开发的众多预测方法的预测准确性,本文提出了一种蛋白质序列的新颖表示方法。这种表示方法涉及氨基酸和双氨基酸的局部组成,以及连续(碱性、疏水性和其他)氨基酸之间距离的局部频率。为了计算局部特征,每个序列被分为三个部分:N端、中间和C端。N端部分进一步分为四个区域,以考虑信号序列长度和位置的不确定性。我们使用支持向量机在从SWISS-PROT数据库提取的两个数据集上测试了这种表示方法。通过五折交叉验证测试,真核生物和原核生物蛋白质的总体准确率分别超过了87%和91%。研究得出结论,考虑N端、中间和C端部分的各自特征有助于预测亚细胞定位。