内含子序列比对中多残基插入缺失的大小、频率和系统发育信号。

Size, frequency, and phylogenetic signal of multiple-residue indels in sequence alignment of introns.

作者信息

Pons Joan, Vogler Alfried P

机构信息

Department of Entomology, The Natural History Museum, London SW7 5BD, UK; Department of Biological Sciences, Imperial College London, Silwood Park Campus, Ascot, Berkshire SL5 7PY, UK.

出版信息

Cladistics. 2006 Apr;22(2):144-156. doi: 10.1111/j.1096-0031.2006.00088.x.

DOI:10.1111/j.1096-0031.2006.00088.x

PMID:34892873

Abstract

Indels in DNA sequences frequently affect more than a single nucleotide, creating problems for alignment, character coding and phylogenetic analysis. However, the size and frequency of multiple-residue indels is not usually tested, and with popular alignment packages their reconstruction is indirectly acheived by reducing the affine (gap extension) cost. We explored the length distribution of indels in intron sequences of the gene Mp20 by modifying the gap opening and gap extension costs. Given a "known" tree for the study group, global homology levels were greatest under low gap cost, with gap extension costs of roughly 0.4-fold the opening cost. Different approaches to gap coding and weighting suggested that taxonomic congruence was correlated with high frequencies of multiple-position indels, with a maximum indel length of 2-5 bp and few indels above 15 bp, but also including a proportion of indels > 100 bp. Only a small minority of indels could be reconstructed as single-position indels. Consequently, tree topologies improved when homologous multinucleotide indels were recoded as binary characters which are otherwise highly homoplastic and weighted characters in single-position coding. In tree-generating alignment procedures as implemented in POY, where gap penalty determines the character weight during tree search, the problem of assigning inappropriately high weight to multiple-residue indels could partly be overcome by setting the extension costs to about 0.4-fold lower than gap opening costs. We conclude that multiple consecutive gap positions are not independent characters and hence methods for parsimony reconstruction of long indels are required. Finally, we also observed a general lack of correlation between taxonomic and character congruence, demonstrating the difficulties of applying congruence criteria to decide among competing alignments. This highlights the value of recent model-based alignment procedures which can implement the statistical distributions of indel size classes, and do not rely on potentially circular strategies for optimizing overall congruence.

摘要

DNA序列中的插入缺失常常影响多个核苷酸，给序列比对、字符编码和系统发育分析带来问题。然而，多残基插入缺失的大小和频率通常未经检验，在常用的比对程序中，通过降低仿射（缺口延伸）成本间接实现其重建。我们通过修改缺口开放和缺口延伸成本，探索了基因Mp20内含子序列中插入缺失的长度分布。对于研究组给定的“已知”树，在低缺口成本下全局同源性水平最高，缺口延伸成本约为开放成本的0.4倍。不同的缺口编码和加权方法表明分类一致性与多位点插入缺失的高频率相关，最大插入缺失长度为2 - 5bp，很少有超过15bp的插入缺失，但也包括一部分长度大于100bp的插入缺失。只有一小部分插入缺失可重建为单一位点插入缺失。因此，当同源多核苷酸插入缺失被重新编码为二元字符（否则具有高度同塑性）并在单一位点编码中加权时，树拓扑结构得到改善。在POY中实现的树生成比对程序中，缺口罚分在树搜索期间决定字符权重，通过将延伸成本设置为比缺口开放成本低约0.4倍，可以部分克服给多残基插入缺失赋予过高权重的问题。我们得出结论，多个连续缺口位置不是独立字符，因此需要用于简约重建长插入缺失的方法。最后，我们还观察到分类一致性和字符一致性之间普遍缺乏相关性，这表明应用一致性标准在竞争比对中进行抉择存在困难。这凸显了最近基于模型的比对程序的价值，这些程序可以实现插入缺失大小类别的统计分布，并且不依赖于优化整体一致性的潜在循环策略。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

内含子序列比对中多残基插入缺失的大小、频率和系统发育信号。

Size, frequency, and phylogenetic signal of multiple-residue indels in sequence alignment of introns.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

内含子序列比对中多残基插入缺失的大小、频率和系统发育信号。

Size, frequency, and phylogenetic signal of multiple-residue indels in sequence alignment of introns.

作者信息

机构信息

出版信息

相似文献

引用本文的文献