基于频繁模式和熵的DNA序列相似性评估

Similarity evaluation of DNA sequences based on frequent patterns and entropy.

作者信息

Xie Xiaojing, Guan Jihong, Zhou Shuigeng

出版信息

BMC Genomics. 2015;16 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2164-16-S3-S5. Epub 2015 Jan 29.

DOI:10.1186/1471-2164-16-S3-S5

PMID:25707937

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4331808/

Abstract

BACKGROUND

DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage.

RESULTS

In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the β-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignment-free methods and the BLASTN tool.

CONCLUSIONS

Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.

摘要

背景

DNA序列分析是生物信息学中的一个重要研究课题。评估序列之间的相似性对于序列分析至关重要，在过去二十年中吸引了大量的研究工作，并且已经开发了十几种算法和工具。这些方法分别基于比对、词频和几何表示，每种方法都有其优缺点。

结果

在本文中，为了有效地计算DNA序列之间的相似性，我们引入了一种基于频率模式和熵的新方法来构建DNA序列的代表性向量。进行了实验以评估所提出的方法，并将其与两种最近开发的无比对方法和BLASTN工具进行比较。当对11个物种的β-珠蛋白基因进行测试并使用MEGA的结果作为基线时，我们的方法比两种无比对方法和BLASTN工具获得了更高的相关系数。

结论

我们的方法不仅能够通过序列分块捕获DNA序列的细粒度信息（位置和顺序），而且由于只考虑最大频繁模式，对噪声和序列重排不敏感。它优于现有的主要方法或工具。

相似文献

BMC Genomics. 2015;16 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2164-16-S3-S5. Epub 2015 Jan 29.

Genomics. 2014 Dec;104(6 Pt B):464-71. doi: 10.1016/j.ygeno.2014.08.010. Epub 2014 Aug 28.

Weighted relative entropy for alignment-free sequence comparison based on Markov model.

J Biomol Struct Dyn. 2011 Feb;28(4):545-55. doi: 10.1080/07391102.2011.10508594.

J Comput Chem. 2011 Mar;32(4):675-80. doi: 10.1002/jcc.21656. Epub 2010 Oct 1.

Biophys Chem. 2009 Jul;143(1-2):55-9. doi: 10.1016/j.bpc.2009.03.013. Epub 2009 Apr 8.

Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word.

J Theor Biol. 2012 Feb 21;295:125-31. doi: 10.1016/j.jtbi.2011.11.021. Epub 2011 Dec 1.

Numerical characterization of DNA sequences based on digital signal method.

Comput Biol Med. 2009 Apr;39(4):388-91. doi: 10.1016/j.compbiomed.2009.01.009. Epub 2009 Mar 3.

A measure of DNA sequence dissimilarity based on free energy of nearest-neighbor interaction.

J Biomol Struct Dyn. 2011 Feb;28(4):557-65. doi: 10.1080/07391102.2011.10508595.

An improved alignment-free model for DNA sequence similarity metric.

BMC Bioinformatics. 2014 Sep 28;15(1):321. doi: 10.1186/1471-2105-15-321.

A novel method for comparative analysis of DNA sequences by Ramanujan-Fourier transform.

J Comput Biol. 2014 Dec;21(12):867-79. doi: 10.1089/cmb.2014.0120.

引用本文的文献

Evolutionary Relationship and the Sequence Similarities among Different Fungal Species Infecting Birds Captured from Different Areas in Denmark.

Arch Razi Inst. 2022 Feb 28;77(1):491-496. doi: 10.22092/ARI.2021.356858.1929. eCollection 2022 Feb.

One novel representation of DNA sequence based on the global and local position information.

Sci Rep. 2018 May 15;8(1):7592. doi: 10.1038/s41598-018-26005-3.

本文引用的文献

Graphical representation for DNA sequences via joint diagonalization of matrix pencil.

IEEE J Biomed Health Inform. 2013 May;17(3):503-11. doi: 10.1109/titb.2012.2227146.

C-curve: a novel 3D graphical representation of DNA sequence based on codons.

Math Biosci. 2013 Feb;241(2):217-24. doi: 10.1016/j.mbs.2012.11.009. Epub 2012 Dec 13.

Compressive genomics.

Nat Biotechnol. 2012 Jul 10;30(7):627-30. doi: 10.1038/nbt.2241.

GReEn: a tool for efficient compression of genome resequencing data.

Nucleic Acids Res. 2012 Feb;40(4):e27. doi: 10.1093/nar/gkr1124. Epub 2011 Dec 1.

MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods.

Mol Biol Evol. 2011 Oct;28(10):2731-9. doi: 10.1093/molbev/msr121. Epub 2011 May 4.

J Comput Chem. 2011 Mar;32(4):675-80. doi: 10.1002/jcc.21656. Epub 2010 Oct 1.

Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing.

Brief Bioinform. 2010 Mar;11(2):181-97. doi: 10.1093/bib/bbp046. Epub 2009 Oct 27.

Single-molecule sequencing of an individual human genome.

Nat Biotechnol. 2009 Sep;27(9):847-50. doi: 10.1038/nbt.1561. Epub 2009 Aug 10.

Alignment-free sequence comparison-a review.

Bioinformatics. 2003 Mar 1;19(4):513-23. doi: 10.1093/bioinformatics/btg005.

Approximate entropy as a measure of system complexity.

Proc Natl Acad Sci U S A. 1991 Mar 15;88(6):2297-301. doi: 10.1073/pnas.88.6.2297.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于频繁模式和熵的DNA序列相似性评估

Similarity evaluation of DNA sequences based on frequent patterns and entropy.

作者信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献