School of computer and communication, Hunan University, Changsha, Hunan, China.
Bioinformatics. 2010 Nov 1;26(21):2678-83. doi: 10.1093/bioinformatics/btq521. Epub 2010 Sep 8.
Biological sequence was regarded as an important study by many biologists, because the sequence contains a large number of biological information, what is helpful for scientists' studies on biological cells, DNA and proteins. Currently, many researchers used the method based on protein sequences in function classification, sub-cellular location, structure and functional site prediction, including some machine-learning methods. The purpose of this article, is to find a new way of sequence analysis, but more simple and effective.
According to the nature of 64 genetic codes, we propose a simple and intuitive 2D graphical expression of protein sequences. And based on this expression we give a new Euclidean-distance method to compute the distance of different sequences for the analysis of sequence similarity. This approach contains more sequence information. A typical phylogenetic tree constructed based on this method proved the effectiveness of our approach. Finally, we use this sequence-similarity-analysis method to predict protein sub-cellular localization, in the two datasets commonly used. The results show that the method is reasonable.
生物序列被许多生物学家视为一项重要的研究,因为序列中包含大量的生物信息,这有助于科学家研究生物细胞、DNA 和蛋白质。目前,许多研究人员在功能分类、亚细胞定位、结构和功能位点预测中使用基于蛋白质序列的方法,包括一些机器学习方法。本文的目的是寻找一种新的序列分析方法,但更简单、更有效。
根据 64 种遗传密码的性质,我们提出了一种简单直观的蛋白质序列 2D 图形表示法。并基于此表示,我们给出了一种新的欧几里得距离方法来计算不同序列之间的距离,以便分析序列的相似性。这种方法包含了更多的序列信息。基于该方法构建的典型系统发育树证明了我们方法的有效性。最后,我们使用这种序列相似性分析方法来预测蛋白质的亚细胞定位,在两个常用的数据集上进行了实验。结果表明,该方法是合理的。