North Carolina Museum of Natural Sciences, Raleigh, 1671 Goldstar Drive, NC 27601, USA.
Department of Ecology and Evolutionary Biology, Yale University, New Haven, 165 Prospect Street, CT 06525, USA.
Syst Biol. 2019 Jan 1;68(1):145-156. doi: 10.1093/sysbio/syy047.
With the rise of genome-scale data sets, there has been a call for increased data scrutiny and careful selection of loci that are appropriate to use in an attempt to resolve a phylogenetic problem. Such loci should maximize phylogenetic information content while minimizing the risk of homoplasy. Theory posits the existence of characters that evolve at an optimum rate, and efforts to determine optimal rates of inference have been a cornerstone of phylogenetic experimental design for over two decades. However, both theoretical and empirical investigations of optimal rates have varied dramatically in their conclusions: spanning no relationship to a tight relationship between the rate of change and phylogenetic utility. Herein, we synthesize these apparently contradictory views, demonstrating both empirical and theoretical conditions under which each is correct. We find that optimal rates of characters-not genes-are generally robust to most experimental design decisions. Moreover, consideration of site rate heterogeneity within a given locus is critical to accurate predictions of utility. Factors such as taxon sampling or the targeted number of characters providing support for a topology are additionally critical to the predictions of phylogenetic utility based on the rate of character change. Further, optimality of rates and predictions of phylogenetic utility are not equivalent, demonstrating the need for further development of comprehensive theory of phylogenetic experimental design. [Divergence time; GC bias; homoplasy; incongruence; information content; internode length; optimal rates; phylogenetic informativeness; phylogenetic theory; phylogenetic utility; phylogenomics; signal and noise; subtending branch length; state space; taxon and character sampling.].
随着基因组规模数据集的兴起,人们呼吁加强数据审查,并仔细选择适合解决系统发育问题的基因座。这些基因座应该最大限度地提高系统发育信息量,同时最大限度地降低同型性的风险。理论假设存在以最佳速率进化的特征,并且确定最佳推断速率的努力一直是系统发育实验设计的基石已有二十多年。然而,对最佳速率的理论和经验研究的结论差异很大:从没有关系到变化率和系统发育效用之间的紧密关系。在此,我们综合了这些看似矛盾的观点,证明了在每种情况下都是正确的经验和理论条件。我们发现,字符(而非基因)的最佳速率通常对大多数实验设计决策具有鲁棒性。此外,在给定基因座内考虑位点速率异质性对于准确预测效用至关重要。分类群采样或支持拓扑结构的目标字符数量等因素对于基于字符变化率的系统发育效用预测也至关重要。此外,速率的最优性和系统发育效用的预测并不等效,这表明需要进一步发展全面的系统发育实验设计理论。