Yin Changchuan, Chen Ying, Yau Stephen S-T
College of Information Systems and Technology, University of Phoenix, Chicago, IL 60601, USA.
Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
J Theor Biol. 2014 Oct 21;359:18-28. doi: 10.1016/j.jtbi.2014.05.043. Epub 2014 Jun 6.
Multiple sequence alignment (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.
多序列比对(MSA)是一种用于DNA序列分类的重要方法,但其在计算复杂度方面存在固有限制。在过去十年中,已经开发出了无比对方法,用于比MSA更高效地比较和分类DNA序列。然而,大多数无比对方法可能会丢失DNA序列的结构和功能信息,因为它们基于特征提取。因此,它们可能无法充分反映DNA序列之间的实际差异。为了更准确地比较和分类DNA序列,需要具有信息守恒的无比对方法。我们提出了一种使用离散傅里叶变换(DFT)的新的DNA序列无比对相似性度量方法。在这种方法中,我们将DNA序列映射到四个二进制指示序列,并将DFT应用于指示序列以将它们变换到频域。DNA序列的完整DFT功率谱的欧几里得距离用作相似性距离度量。为了比较不同长度的DNA序列的DFT功率谱,我们提出了一种均匀缩放方法,将较短的DFT功率谱扩展到与所比较序列的最长长度相等。在DFT功率谱均匀缩放之后,在相同的DFT频率空间维度中比较DNA序列。我们使用模拟的DNA和病毒序列评估层次聚类中相似性度量的准确性。结果表明,基于DFT的方法是一种有效且准确的DNA序列相似性度量方法。