Xie Xiaojing, Guan Jihong, Zhou Shuigeng
BMC Genomics. 2015;16 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2164-16-S3-S5. Epub 2015 Jan 29.
DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage.
In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the β-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignment-free methods and the BLASTN tool.
Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.
DNA序列分析是生物信息学中的一个重要研究课题。评估序列之间的相似性对于序列分析至关重要,在过去二十年中吸引了大量的研究工作,并且已经开发了十几种算法和工具。这些方法分别基于比对、词频和几何表示,每种方法都有其优缺点。
在本文中,为了有效地计算DNA序列之间的相似性,我们引入了一种基于频率模式和熵的新方法来构建DNA序列的代表性向量。进行了实验以评估所提出的方法,并将其与两种最近开发的无比对方法和BLASTN工具进行比较。当对11个物种的β-珠蛋白基因进行测试并使用MEGA的结果作为基线时,我们的方法比两种无比对方法和BLASTN工具获得了更高的相关系数。
我们的方法不仅能够通过序列分块捕获DNA序列的细粒度信息(位置和顺序),而且由于只考虑最大频繁模式,对噪声和序列重排不敏感。它优于现有的主要方法或工具。