Dress Andreas W M, Flamm Christoph, Fritzsch Guido, Grünewald Stefan, Kruspe Matthias, Prohaska Sonja J, Stadler Peter F
Department of Combinatorics and Geometry (DCG), MPG/CAS Partner Institute for Computational Biology (PICB), Shanghai Institutesfor Biological Sciences (SIBS), Shanghai, PR China.
Algorithms Mol Biol. 2008 Jun 24;3:7. doi: 10.1186/1748-7188-3-7.
Sequence-based methods for phylogenetic reconstruction from (nucleic acid) sequence data are notoriously plagued by two effects: homoplasies and alignment errors. Large evolutionary distances imply a large number of homoplastic sites. As most protein-coding genes show dramatic variations in substitution rates that are not uncorrelated across the sequence, this often leads to a patchwork pattern of (i) phylogenetically informative and (ii) effectively randomized regions. In highly variable regions, furthermore, alignment errors accumulate resulting in sometimes misleading signals in phylogenetic reconstruction.
We present here a method that, based on assessing the distribution of character states along a cyclic ordering of the taxa, allows the identification of phylogenetically uninformative homoplastic sites in a multiple sequence alignment. Removal of these sites appears to improve the performance of phylogenetic reconstruction algorithms as measured by various indices of "tree quality". In particular, we obtain more stable trees due to the exclusion of phylogenetically incompatible sites that most likely represent strongly randomized characters.
The computer program noisy implements this approach. It can be employed to improving phylogenetic reconstruction capability with quite a considerable success rate whenever (1) the average bootstrap support obtained from the original alignment is low, and (2) there are sufficiently many taxa in the data set - at least, say, 12 to 15 taxa. The software can be obtained under the GNU Public License from http://www.bioinf.uni-leipzig.de/Software/noisy/.
基于序列的(核酸)序列数据系统发育重建方法,因两种效应而声名狼藉:同塑性和比对错误。大的进化距离意味着大量的同塑性位点。由于大多数蛋白质编码基因在替换率上表现出显著变化,且这些变化在序列中并非不相关,这常常导致一种拼凑模式,即(i)系统发育信息丰富的区域和(ii)有效随机化的区域。此外,在高度可变区域,比对错误会累积,从而在系统发育重建中有时会产生误导性信号。
我们在此提出一种方法,该方法基于评估沿着分类单元的循环排序的字符状态分布,能够在多序列比对中识别出系统发育无信息的同塑性位点。去除这些位点似乎能提高系统发育重建算法的性能,这通过各种“树质量”指标来衡量。特别是,由于排除了很可能代表高度随机化字符的系统发育不兼容位点,我们得到了更稳定的树。
计算机程序noisy实现了这种方法。只要(1)从原始比对获得的平均自展支持率较低,以及(2)数据集中有足够多的分类单元——至少比如说12到15个分类单元,就可以相当成功地使用它来提高系统发育重建能力。该软件可根据GNU公共许可证从http://www.bioinf.uni-leipzig.de/Software/noisy/获取。