Feng Z P
Department of Physics, Tianjin University, Tianjin 300072, China.
Biopolymers. 2001 Apr 15;58(5):491-9. doi: 10.1002/1097-0282(20010415)58:5<491::AID-BIP1024>3.0.CO;2-I.
A new representation of protein sequence is devoted in this paper, in which each protein can be represented by a 20-dimensional (20D) vector of unit length. Inspired by the principle of superposition of state in quantum mechanics, the squares of the 20 components of the vector correspond to the amino acid composition. Using the new representation of the primary sequence and Bayes Discriminant Algorithm, the subcellular location of prokaryotic proteins was predicted. The overall predictive accuracy in the jackknife test can be 3% higher than the result of using amino acid composition directly for the database of sequence identity is less than 90%, but 5% higher when sequence identity is less than 80%. The higher predictive accuracy indicates that the current measure of extracting the information from the primary sequence is efficient. Since the subcellular location restricting a protein's possible function, the present method should also be a useful measure for the systematic analysis of genome data. The program used in this paper is available on request.
本文提出了一种蛋白质序列的新表示方法,其中每个蛋白质都可以由一个单位长度的20维(20D)向量表示。受量子力学中态叠加原理的启发,该向量的20个分量的平方对应于氨基酸组成。利用一级序列的新表示方法和贝叶斯判别算法,对原核生物蛋白质的亚细胞定位进行了预测。对于序列同一性小于90%的数据库,留一法检验中的总体预测准确率比直接使用氨基酸组成的结果高3%,而当序列同一性小于80%时,准确率高5%。较高的预测准确率表明当前从一级序列中提取信息的方法是有效的。由于亚细胞定位限制了蛋白质的可能功能,本方法也应是基因组数据分析的一种有用方法。本文使用的程序可根据要求提供。