He Lily, Li Yongkun, He Rong Lucy, Yau Stephen S-T
Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
Department of Biological Sciences, Chicago State University, Chicago, IL, USA.
J Theor Biol. 2017 Aug 1;427:41-52. doi: 10.1016/j.jtbi.2017.06.002. Epub 2017 Jun 3.
Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.
蛋白质分类是生物学中的关键课题。在过去十年中,数据库中存储的蛋白质序列数量急剧增加。传统上,蛋白质序列的比较通常通过多序列比对方法进行。然而,当基因重排发生时,如在病毒基因组中,这些方法可能不适用于蛋白质序列的聚类。对于具有长基因组的大型数据集,计算也非常耗时。在本文中,基于氨基酸的三个重要生化特性:亲水性指数、极性需求和侧链的化学组成,我们提出了一个24维特征向量来描述蛋白质序列中氨基酸的组成。我们的方法不仅利用了氨基酸的化学性质,还考虑了它们的数量和位置。在β-珠蛋白、哺乳动物和三个病毒数据集上的结果表明,这个新工具在蛋白质分类和推断生物系统发育方面快速且准确。