Sun Zeju, Pei Shaojun, He Rong Lucy, Yau Stephen S-T
Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.
Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA.
Comput Struct Biotechnol J. 2020 Jul 15;18:1904-1913. doi: 10.1016/j.csbj.2020.07.004. eCollection 2020.
Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
混沌游戏表示法(CGR)最初被提出作为一种DNA的图像表示方法,并已扩展到其他生物大分子的情况。与DNA的CGR图像不同,在DNA的CGR图像中,DNA序列被转换为单位正方形中的一系列点,现有的蛋白质CGR图像在几何形状上不那么优美,并且CGR图像中点的分布含义也不那么明显。在本研究中,通过将二十种氨基酸自然地分布在正十二面体的顶点上,我们用CGR方法引入了一种新的蛋白质序列三维图像表示。我们还将每个CGR图像与高维欧几里得空间中的一个向量相关联,称为扩展自然向量(ENV),以便分析CGR图像中包含的信息。基于蛋白质分类和系统发育分析的结果,我们的方法可以作为一种精确的方法来发现蛋白质之间的生物学关系。