Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, IL 60607, USA.
Gene. 2018 Oct 5;673:239-250. doi: 10.1016/j.gene.2018.06.042. Epub 2018 Jun 20.
Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum).
使用数学方法分析系统发育关系在生物信息学中一直很重要。定量研究可以以精确的方式解释原始生物数据。多序列比对(MSA)常用于分析生物进化,但非常耗时。当数据规模较大时,对齐方法无法在合理的时间内完成计算。因此,我们提出了一种使用累积傅里叶功率谱矩的新方法来对 DNA 序列进行聚类。每个序列都被转换为欧几里得空间中的向量。向量之间的距离可以反映序列之间的关系。谱与矩向量之间的映射是一一对应的,这意味着在计算过程中不会丢失功率谱中的任何信息。我们对包括流感 A、灵长类动物和人鼻病毒(HRV)数据集在内的几个数据集进行聚类和分类,以构建系统发育树。结果表明,新提出的累积傅里叶功率谱比 MSA 和另一种称为 k-mer 的无对齐方法快得多,也准确得多。该研究为系统发育、进化以及大型基因组的高效 DNA 比较算法的研究提供了新的思路。累积傅里叶功率谱的计算机程序可在 GitHub(https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum)上获得。