Li Yushuang, Lv Yanfen, Li Xiaonan, Xiao Wenli, Li Chun
School of Science, Yanshan University, Qinhuangdao 066004, PR China.
School of Science, Yanshan University, Qinhuangdao 066004, PR China.
J Theor Biol. 2017 Apr 7;418:84-93. doi: 10.1016/j.jtbi.2017.01.031. Epub 2017 Jan 27.
Four new inter-nucleotide distance sequences for a DNA sequence are defined. They are different from ones presented by Afreixo et al., and overcome the irreversible defect of the global inter-nucleotide distance sequence proposed by Nair and Mahalakshmi. Five basic statistical quantities are extracted from (ordered) precise inter-nucleotide distance sequences to construct a 20 dimensional feature vector. This simple mathematical descriptor of DNA sequence plays crucial roles in sequence comparison and essential gene identification. Euclidean distance between feature vectors is utilized to compare similarities among whole mitochondrial genomes of 18 eutherian mammals and 23 sequences of 16S ribosomal RNA, respectively. Derived phylogenetic trees are quite agreement with a few popular studies. Furthermore, using feature vector as input a support vector machine (SVM)-based method are developed to identify essential genes and non-essential genes of 5 bacteria. Higher AUC values (the minimum is 0.7971, the highest reaches 0.8751 and the average is 0.8174) than some well-known results confirm the performance of the method.
定义了DNA序列的四种新的核苷酸间距序列。它们不同于阿弗雷肖等人提出的序列,并且克服了奈尔和玛哈拉克希米提出的全局核苷酸间距序列的不可逆缺陷。从(有序的)精确核苷酸间距序列中提取五个基本统计量,以构建一个20维特征向量。这种简单的DNA序列数学描述符在序列比较和关键基因识别中起着至关重要的作用。分别利用特征向量之间的欧几里得距离来比较18种真兽类哺乳动物的全线粒体基因组和16S核糖体RNA的23个序列之间的相似性。推导得到的系统发育树与一些流行研究的结果相当一致。此外,以特征向量作为输入,开发了一种基于支持向量机(SVM)的方法来识别5种细菌的必需基因和非必需基因。比一些知名结果更高的AUC值(最小值为0.7971,最高达到0.8751,平均为0.8174)证实了该方法的性能。