Wang Yi, Tian Kun, Yau Stephen S-T
Department of Mathematical Sciences, Tsinghua University, Beijing, P.R. China.
J Comput Biol. 2019 Apr;26(4):315-321. doi: 10.1089/cmb.2018.0216. Epub 2019 Feb 14.
Protein kinase C (PKC) is a superfamily of enzymes, which regulate numerous cellular responses. The specific function of PKC protein family is mainly governed by its individual protein domains. However, existing protein sequence classification methods based on sequence alignment and sequence analysis models focused little on the domain analysis. In this study, we introduce a novel protein kinase classification method that considers both domain sequence similarity and whole sequence similarity to quantify the evolutionary distance from a specific protein to a protein family. Using the natural vector method, we establish a 60-dimensional space, where each protein is uniquely represented by a vector. We also define a convex hull, consisting of the natural vectors corresponding to all members of a protein family. The sequence similarity between a protein and a protein family, therefore, can be quantified as the distance between the protein vector and the protein family convex hull. We have applied this method in a PKC sample library and the results showed a higher accuracy of classification compared with other alignment-free methods.
蛋白激酶C(PKC)是一类酶的超家族,可调节多种细胞反应。PKC蛋白家族的特定功能主要由其单个蛋白结构域决定。然而,现有的基于序列比对和序列分析模型的蛋白质序列分类方法很少关注结构域分析。在本研究中,我们引入了一种新颖的蛋白激酶分类方法,该方法同时考虑结构域序列相似性和全序列相似性,以量化从特定蛋白质到蛋白质家族的进化距离。使用自然向量法,我们建立了一个60维空间,其中每个蛋白质由一个向量唯一表示。我们还定义了一个凸包,它由与蛋白质家族所有成员对应的自然向量组成。因此,蛋白质与蛋白质家族之间的序列相似性可以量化为蛋白质向量与蛋白质家族凸包之间的距离。我们已将此方法应用于PKC样本库,结果表明与其他无比对方法相比,分类准确性更高。