一种通过傅里叶变换衡量DNA序列相似性及其在层次聚类中的应用

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.

作者信息

Yin Changchuan, Chen Ying, Yau Stephen S-T

机构信息

College of Information Systems and Technology, University of Phoenix, Chicago, IL 60601, USA.

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.

出版信息

J Theor Biol. 2014 Oct 21;359:18-28. doi: 10.1016/j.jtbi.2014.05.043. Epub 2014 Jun 6.

DOI:10.1016/j.jtbi.2014.05.043

PMID:24911780

Abstract

Multiple sequence alignment (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.

摘要

多序列比对（MSA）是一种用于DNA序列分类的重要方法，但其在计算复杂度方面存在固有限制。在过去十年中，已经开发出了无比对方法，用于比MSA更高效地比较和分类DNA序列。然而，大多数无比对方法可能会丢失DNA序列的结构和功能信息，因为它们基于特征提取。因此，它们可能无法充分反映DNA序列之间的实际差异。为了更准确地比较和分类DNA序列，需要具有信息守恒的无比对方法。我们提出了一种使用离散傅里叶变换（DFT）的新的DNA序列无比对相似性度量方法。在这种方法中，我们将DNA序列映射到四个二进制指示序列，并将DFT应用于指示序列以将它们变换到频域。DNA序列的完整DFT功率谱的欧几里得距离用作相似性距离度量。为了比较不同长度的DNA序列的DFT功率谱，我们提出了一种均匀缩放方法，将较短的DFT功率谱扩展到与所比较序列的最长长度相等。在DFT功率谱均匀缩放之后，在相同的DFT频率空间维度中比较DNA序列。我们使用模拟的DNA和病毒序列评估层次聚类中相似性度量的准确性。结果表明，基于DFT的方法是一种有效且准确的DNA序列相似性度量方法。

相似文献

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.一种通过傅里叶变换衡量DNA序列相似性及其在层次聚类中的应用

J Theor Biol. 2014 Oct 21;359:18-28. doi: 10.1016/j.jtbi.2014.05.043. Epub 2014 Jun 6.

An improved model for whole genome phylogenetic analysis by Fourier transform.一种通过傅里叶变换进行全基因组系统发育分析的改进模型。

J Theor Biol. 2015 Oct 7;382:99-110. doi: 10.1016/j.jtbi.2015.06.033. Epub 2015 Jul 4.

A novel method for comparative analysis of DNA sequences by Ramanujan-Fourier transform.一种通过拉马努金-傅里叶变换对DNA序列进行比较分析的新方法。

J Comput Biol. 2014 Dec;21(12):867-79. doi: 10.1089/cmb.2014.0120.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

A new method to cluster DNA sequences using Fourier power spectrum.一种使用傅里叶功率谱对DNA序列进行聚类的新方法。

J Theor Biol. 2015 May 7;372:135-45. doi: 10.1016/j.jtbi.2015.02.026. Epub 2015 Mar 5.

Sequence alignment by cross-correlation.通过互相关进行序列比对。

J Biomol Tech. 2005 Dec;16(4):453-8.

A new method to cluster genomes based on cumulative Fourier power spectrum.一种基于累积傅里叶功率谱的基因组聚类新方法。

Gene. 2018 Oct 5;673:239-250. doi: 10.1016/j.gene.2018.06.042. Epub 2018 Jun 20.

Alignment method for spectrograms of DNA sequences.DNA序列频谱图的比对方法。

IEEE Trans Inf Technol Biomed. 2010 Jan;14(1):3-9. doi: 10.1109/TITB.2009.2033052. Epub 2009 Sep 29.

A new method to analyze protein sequence similarity using Dynamic Time Warping.一种使用动态时间规整分析蛋白质序列相似性的新方法。

Genomics. 2017 Mar;109(2):123-130. doi: 10.1016/j.ygeno.2016.12.002. Epub 2016 Dec 11.

Homology assessment and molecular sequence alignment.同源性评估与分子序列比对。

J Biomed Inform. 2006 Feb;39(1):18-33. doi: 10.1016/j.jbi.2005.11.005. Epub 2005 Dec 9.

引用本文的文献

CAKL: Commutative algebra k-mer learning of genomics.CAKL：基因组学的交换代数k-mer学习

ArXiv. 2025 Aug 13:arXiv:2508.09406v1.

Predicting chromosomal compartments directly from the nucleotide sequence with DNA-DDA.利用DNA-DDA直接从核苷酸序列预测染色体区室。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad198.

A new gene tree algorithm employing DNA sequences of bovine genome using discrete Fourier transformation.一种利用牛基因组 DNA 序列通过离散傅里叶变换的新基因树算法。

PLoS One. 2023 Mar 9;18(3):e0277480. doi: 10.1371/journal.pone.0277480. eCollection 2023.

FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis.FFP：氨基酸特性感知系统发育分析中的联合快速傅里叶变换和分形维数。

BMC Bioinformatics. 2022 Aug 19;23(1):347. doi: 10.1186/s12859-022-04889-3.

An efficient numerical representation of genome sequence: natural vector with covariance component.基因组序列的高效数值表示：具有协方差分量的自然向量。

PeerJ. 2022 Jun 16;10:e13544. doi: 10.7717/peerj.13544. eCollection 2022.

Identification of HIV Rapid Mutations Using Differences in Nucleotide Distribution over Time.利用核苷酸随时间分布的差异鉴定 HIV 快速突变。

Genes (Basel). 2022 Jan 19;13(2):170. doi: 10.3390/genes13020170.

Full Chromosomal Relationships Between Populations and the Origin of Humans.群体之间的全染色体关系与人类起源

Front Genet. 2022 Feb 2;12:828805. doi: 10.3389/fgene.2021.828805. eCollection 2021.

MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.MathFeature：基于数学描述符的 DNA、RNA 和蛋白质序列特征提取包。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab434.

The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers.作为可解释机器学习分类器的生物分子序列结构指纹的解析互信息函数

Entropy (Basel). 2021 Oct 17;23(10):1357. doi: 10.3390/e23101357.

Advances in the computational analysis of SARS-COV2 genome.严重急性呼吸综合征冠状病毒2（SARS-CoV-2）基因组的计算分析进展

Nonlinear Dyn. 2021;106(2):1525-1555. doi: 10.1007/s11071-021-06836-y. Epub 2021 Aug 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种通过傅里叶变换衡量DNA序列相似性及其在层次聚类中的应用

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献