Snir Sagi, Pachter Lior
Department of Evolutionary Biology and the Institute of Evolution, Haifa University, Haifa, Israel.
J Comput Biol. 2011 Aug;18(8):967-86. doi: 10.1089/cmb.2010.0325. Epub 2011 Jul 5.
Sequence alignment (the grouping of homologous bases into one column) is fundamental to almost any task in comparative genomics. This translates to positing gaps in the genomic sequences to account for events of insertions and deletions (indels). The interrelationship between sequence alignment and phylogenetic reconstruction has drawn substantial attention recently with works showing the significance of differences in alignments. One of the plausible approaches in this direction is to grade the suitability of a tree to an associated alignment and vice verse. We here present a combinatorial (as opposed to statistical) approach based on the indel history. We show--both by simulations and by using real biological data from the Encyclopedia of DNA Elements (ENCODE)--that this criterion is sound. The novelty of our approach is the distinguishing between insertions and deletions, and augmenting the analysis with a dimension of "depth," extending it from the sequence space to the phylogenetic space. Using this approach, we perform a comprehensive study of indel characteristic behavior among mammals in both coding and non-coding regions. Our results show significant differences in indel patterns between coding and non-coding regions. We also show other characteristic patterns of indel evolution in the depth of the underlying phylogeny.
序列比对(将同源碱基分组到同一列中)几乎是比较基因组学中任何任务的基础。这意味着在基因组序列中设置空位,以解释插入和缺失事件(插入缺失)。序列比对与系统发育重建之间的相互关系最近引起了广泛关注,有研究表明比对差异的重要性。在这个方向上一种可行的方法是对一棵树与相关比对的适合度进行分级,反之亦然。我们在此提出一种基于插入缺失历史的组合方法(与统计方法相对)。我们通过模拟以及使用来自DNA元件百科全书(ENCODE)的真实生物学数据表明,该标准是合理的。我们方法的新颖之处在于区分插入和缺失,并通过“深度”维度扩展分析,将其从序列空间扩展到系统发育空间。使用这种方法,我们对哺乳动物编码区和非编码区的插入缺失特征行为进行了全面研究。我们的结果表明,编码区和非编码区的插入缺失模式存在显著差异。我们还展示了基础系统发育深度中插入缺失进化的其他特征模式。