Department of Integrative Biology, University of California, Berkeley, 3060 Valley Life Sciences Building #3140, Berkeley, CA 94720-3140, USA.
Syst Biol. 2011 Jan;60(1):60-73. doi: 10.1093/sysbio/syq074. Epub 2010 Nov 15.
Nearly all commonly used methods of phylogenetic inference assume that characters in an alignment evolve independently of one another. This assumption is attractive for simplicity and computational tractability but is not biologically reasonable for RNAs and proteins that have secondary and tertiary structures. Here, we simulate RNA and protein-coding DNA sequence data under a general model of dependence in order to assess the robustness of traditional methods of phylogenetic inference to violation of the assumption of independence among sites. We find that the accuracy of independence-assuming methods is reduced by the dependence among sites; for proteins this reduction is relatively mild, but for RNA this reduction may be substantial. We introduce the concept of effective sequence length and its utility for considering information content in phylogenetics.
几乎所有常用的系统发育推断方法都假设比对中的特征彼此独立地进化。这种假设在简单性和计算可处理性方面很有吸引力,但对于具有二级和三级结构的 RNA 和蛋白质来说,这在生物学上是不合理的。在这里,我们模拟了依赖于一般模型的 RNA 和蛋白质编码 DNA 序列数据,以评估传统的系统发育推断方法对违反站点之间独立性假设的稳健性。我们发现,依赖于站点的方法的准确性降低了;对于蛋白质来说,这种降低相对较轻,但对于 RNA 来说,这种降低可能是实质性的。我们引入了有效序列长度的概念及其在系统发育学中考虑信息含量的有用性。