Morgenstern Burkhard
University of Göttingen, Department of Bioinformatics (IMG), Göttingen, Germany.
Methods Mol Biol. 2021;2231:121-134. doi: 10.1007/978-1-0716-1036-7_8.
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
序列比对是DNA和蛋白质序列分析的核心。然而,对于如今由大规模平行测序技术产生的数据量而言,两两比对和多重比对方法往往过于缓慢。因此,近年来快速的无比对序列比较方法变得流行起来。这些方法大多基于固定长度单词的词频,或基于词匹配统计。其他方法则使用最大词匹配的长度。虽然这些方法非常快速,但它们大多依赖于难以解释的序列相似性或不相似性的特设度量。在本章中,我将描述一些我们近年来开发的无比对方法。我们的方法基于间隔词匹配(“SpaM”),即基于不精确的词匹配,允许在某些预定义位置包含错配。与大多数以前的无比对方法不同,我们的方法能够使用分子进化的随机模型准确估计DNA或蛋白质序列之间的系统发育距离。