School of Mathematics and Statistics, Shandong University at Weihai, Weihai, China, 264209.
Evol Bioinform Online. 2011;7:149-58. doi: 10.4137/EBO.S7364. Epub 2011 Oct 4.
Determination of sequence similarity is one of the major steps in computational phylogenetic studies. As we know, during evolutionary history, not only DNA mutations for individual nucleotide but also subsequent rearrangements occurred. It has been one of major tasks of computational biologists to develop novel mathematical descriptors for similarity analysis such that various mutation phenomena information would be involved simultaneously. In this paper, different from traditional methods (eg, nucleotide frequency, geometric representations) as bases for construction of mathematical descriptors, we construct novel mathematical descriptors based on graph theory. In particular, for each DNA sequence, we will set up a weighted directed graph. The adjacency matrix of the directed graph will be used to induce a representative vector for DNA sequence. This new approach measures similarity based on both ordering and frequency of nucleotides so that much more information is involved. As an application, the method is tested on a set of 0.9-kb mtDNA sequences of twelve different primate species. All output phylogenetic trees with various distance estimations have the same topology, and are generally consistent with the reported results from early studies, which proves the new method's efficiency; we also test the new method on a simulated data set, which shows our new method performs better than traditional global alignment method when subsequent rearrangements happen frequently during evolutionary history.
序列相似性的确定是计算系统发育研究中的主要步骤之一。众所周知,在进化历史中,不仅发生了单个核苷酸的 DNA 突变,而且还发生了随后的重排。开发用于相似性分析的新的数学描述符一直是计算生物学家的主要任务之一,以便同时涉及各种突变现象信息。在本文中,我们与传统方法(例如核苷酸频率、几何表示)不同,将基于图论构建新的数学描述符。具体来说,对于每个 DNA 序列,我们将建立一个加权有向图。有向图的邻接矩阵将用于诱导 DNA 序列的代表向量。这种新方法基于核苷酸的排序和频率来衡量相似性,因此涉及更多信息。作为应用,该方法在一组来自 12 种不同灵长类动物的 0.9-kb mtDNA 序列上进行了测试。使用各种距离估计的所有输出系统发育树具有相同的拓扑结构,并且通常与早期研究报告的结果一致,这证明了新方法的有效性;我们还在模拟数据集上测试了新方法,当进化历史中频繁发生后续重排时,新方法的性能优于传统的全局比对方法。