Brief Bioinform. 2014 Nov;15(6):890-905. doi: 10.1093/bib/bbt052. Epub 2013 Jul 31.
Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.
现代测序和基因组组装技术提供了丰富的数据,这些数据很快将需要通过比较分析来发现。序列比对是生物信息学研究中的一项基本任务,但也存在一些注意事项。由于在处理大量序列数据时计算成本较高,动态规划的开创性技术和方法在这项工作中证明是无效的。由于遗传重组、遗传改组和其他内在的生物学事件,这些方法容易给出误导性信息。信息论、频率分析和数据压缩的新方法已经可用,并为动态规划提供了强大的替代方案。这些新方法通常更受欢迎,因为它们的算法更简单,不受同线性相关问题的影响。在这篇综述中,我们详细讨论了基于统计分析的基于无比对方法的计算工具。我们提供了几个清晰的例子,演示了无比对分析的几个不同领域的应用和解释,如碱基-碱基相关性、特征频率分布、组成向量、改进的字符串组成和 D2 统计量。此外,我们还详细讨论了数据压缩中 Lempel-Ziv 技术的分析示例。