Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, PR China.
Department of Biological Sciences, Chicago State University, Chicago, IL60628, USA.
Sci Rep. 2017 Sep 22;7(1):12226. doi: 10.1038/s41598-017-12493-2.
With sharp increasing in biological sequences, the traditional sequence alignment methods become unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence analysis. Among these methods, many sorts of feature vector methods are established and applied to reconstruction of species phylogeny. The vectors basically consist of some typical numerical features for certain biological problems. The features may come from the primary sequences, secondary or three dimensional structures of macromolecules. In this study, we propose a novel numerical vector based on only primary sequences of organism to build their phylogeny. Three chemical and physical properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of letters. Therefore, three sequences are constructed according to the three properties. For each letter of each sequence we calculate the number of the letter, the average position of the letter and the variation of the position of the letter appearing in the sequence. Tested on several datasets related to mammals, viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
随着生物序列数量的急剧增加,传统的序列比对方法变得不适用和不可行。这促使了快速的无比对序列分析技术的涌现。在这些方法中,建立了许多种特征向量方法,并应用于物种系统发育的重建。这些向量基本上由某些特定生物问题的典型数值特征组成。这些特征可能来自于生物的一级序列、二级或三维结构。在本研究中,我们提出了一种基于生物一级序列的新的数值向量,用于构建它们的系统发育。我们还将三种化学和物理性质(嘌呤、嘧啶和酮)纳入到向量中。对于每种性质,我们将核苷酸序列转换为仅由两种字母组成的新序列。因此,根据三种性质构建了三个序列。对于每个序列中的每个字母,我们计算字母的数量、字母的平均位置以及在序列中出现的字母位置的变化。在与哺乳动物、病毒和细菌相关的几个数据集上进行测试,这个新工具在速度上很快,并且能够准确推断生物的系统发育。