Pei Shaojun, Dong Rui, He Rong Lucy, Yau Stephen S-T
Department of Mathematical Sciences, Tsinghua University, Beijing, PR China.
Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA.
Comput Struct Biotechnol J. 2019 Jul 11;17:982-994. doi: 10.1016/j.csbj.2019.07.003. eCollection 2019.
Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.
基因组比较是生物信息学的一个重要研究领域。对于大规模的基因组比较,由于其算法复杂性,多序列比对(MSA)方法已不实用。在本研究中,我们基于DNA序列与其累积傅里叶功率和相位谱的完整中心矩向量之间的一一对应关系,提出了一种新颖的无比对方法。此外,还考虑了功率和相位谱中四个核苷酸之间的协方差。我们使用累积傅里叶功率和相位谱为每个DNA序列定义一个28维向量。向量之间的欧几里得距离可以衡量DNA序列之间的差异。我们使用不同大小和类型的数据集进行测试,包括模拟DNA序列、外显子 - 内含子和完整基因组。结果表明,与其他无比对方法和MSA方法相比,我们的方法在进行层次聚类时更准确、高效。