Shen Xing-Xing, Salichos Leonidas, Rokas Antonis
Department of Biological Sciences, Vanderbilt University.
Department of Biological Sciences, Vanderbilt University Department of Molecular Biophysics and Biochemistry, Yale University.
Genome Biol Evol. 2016 Sep 2;8(8):2565-80. doi: 10.1093/gbe/evw179.
Molecular phylogenetic inference is inherently dependent on choices in both methodology and data. Many insightful studies have shown how choices in methodology, such as the model of sequence evolution or optimality criterion used, can strongly influence inference. In contrast, much less is known about the impact of choices in the properties of the data, typically genes, on phylogenetic inference. We investigated the relationships between 52 gene properties (24 sequence-based, 19 function-based, and 9 tree-based) with each other and with three measures of phylogenetic signal in two assembled data sets of 2,832 yeast and 2,002 mammalian genes. We found that most gene properties, such as evolutionary rate (measured through the percent average of pairwise identity across taxa) and total tree length, were highly correlated with each other. Similarly, several gene properties, such as gene alignment length, Guanine-Cytosine content, and the proportion of tree distance on internal branches divided by relative composition variability (treeness/RCV), were strongly correlated with phylogenetic signal. Analysis of partial correlations between gene properties and phylogenetic signal in which gene evolutionary rate and alignment length were simultaneously controlled, showed similar patterns of correlations, albeit weaker in strength. Examination of the relative importance of each gene property on phylogenetic signal identified gene alignment length, alongside with number of parsimony-informative sites and variable sites, as the most important predictors. Interestingly, the subsets of gene properties that optimally predicted phylogenetic signal differed considerably across our three phylogenetic measures and two data sets; however, gene alignment length and RCV were consistently included as predictors of all three phylogenetic measures in both yeasts and mammals. These results suggest that a handful of sequence-based gene properties are reliable predictors of phylogenetic signal and could be useful in guiding the choice of phylogenetic markers.
分子系统发育推断本质上依赖于方法和数据方面的选择。许多有见地的研究表明,方法上的选择,比如所使用的序列进化模型或最优性标准,会对推断产生强烈影响。相比之下,对于数据(通常是基因)属性方面的选择对系统发育推断的影响则了解得少得多。我们在两个分别由2832个酵母基因和2002个哺乳动物基因组成的数据集中,研究了52种基因属性(24种基于序列的、19种基于功能的和9种基于树的)之间的相互关系,以及它们与三种系统发育信号度量之间的关系。我们发现,大多数基因属性,如进化速率(通过跨分类单元的成对同一性平均百分比来衡量)和总树长,彼此高度相关。同样,一些基因属性,如基因比对长度、鸟嘌呤 - 胞嘧啶含量,以及内部分支上的树距离除以相对组成变异性(树性/RCV)的比例,与系统发育信号强烈相关。在同时控制基因进化速率和比对长度的情况下,对基因属性和系统发育信号之间的偏相关性分析显示了类似的相关模式,尽管强度较弱。对每种基因属性对系统发育信号的相对重要性进行检验,确定基因比对长度以及简约信息位点数量和可变位点数量是最重要的预测因子。有趣的是,在我们的三种系统发育度量和两个数据集中,能最优预测系统发育信号的基因属性子集差异很大;然而,基因比对长度和RCV在酵母和哺乳动物中始终被列为所有三种系统发育度量的预测因子。这些结果表明,少数基于序列的基因属性是系统发育信号的可靠预测因子,可用于指导系统发育标记的选择。