Novic M, Randic M
National Institute of Chemistry, Hajdrihova, Ljubljana, Slovenia.
SAR QSAR Environ Res. 2008 Apr-Jun;19(3-4):317-37. doi: 10.1080/10629360802085066.
A novel representation of proteins was introduced. It is independent of arbitrary decisions with respect to the choice of labels to be assigned to the 20 natural amino acids. The approach is based on an assignment of 20 unit vectors in 20-dimensional vector space to the 20 natural amino acids. Proteins are then represented by a walk, that is, a sequence of steps in the 20-dimensional space analogous to a walk in the (x, y) plane in the case of binary strings. A straightforward numerical characterization of proteins is obtained from the distance matrix associated with the walk representing the protein in 20-dimensional space combining the information on the Euclidean distance between various amino acids in protein sequence. The Line Distance matrix offers additional numerical characterization of proteins, while the lengths of steps of the walk in 20-D space allow construction of a "protein profile," which represents distribution of average lengths of the steps and their powers.
引入了一种蛋白质的新表示方法。它不依赖于在为20种天然氨基酸分配标签时的任意决定。该方法基于在20维向量空间中为20种天然氨基酸分配20个单位向量。然后,蛋白质由一条路径表示,即20维空间中的一系列步骤,类似于二进制字符串情况下在(x, y)平面中的路径。通过与表示20维空间中蛋白质的路径相关联的距离矩阵,结合蛋白质序列中各种氨基酸之间欧几里得距离的信息,可获得蛋白质的直接数值表征。线距离矩阵提供了蛋白质的额外数值表征,而20维空间中路径的步长允许构建一个“蛋白质轮廓”,它表示步长及其幂的平均长度分布。