Department of Genetics, Evolution and Environment, University College London, London, United Kingdom.
Department of Mathematics, Beijing Jiaotong University, Beijing, P.R. China.
Mol Biol Evol. 2020 Nov 1;37(11):3211-3224. doi: 10.1093/molbev/msaa166.
We use computer simulation to examine the information content in multilocus data sets for inference under the multispecies coalescent model. Inference problems considered include estimation of evolutionary parameters (such as species divergence times, population sizes, and cross-species introgression probabilities), species tree estimation, and species delimitation based on Bayesian comparison of delimitation models. We found that the number of loci is the most influential factor for almost all inference problems examined. Although the number of sequences per species does not appear to be important to species tree estimation, it is very influential to species delimitation. Increasing the number of sites and the per-site mutation rate both increase the mutation rate for the whole locus and these have the same effect on estimation of parameters, but the sequence length has a greater effect than the per-site mutation rate for species tree estimation. We discuss the computational costs when the data size increases and provide guidelines concerning the subsampling of genomic data to enable the application of full-likelihood methods of inference.
我们使用计算机模拟来研究多基因座数据集在多物种合并模型下的信息含量,以进行推断。所考虑的推断问题包括进化参数(如物种分歧时间、种群大小和跨物种基因渗入概率)的估计、物种树估计以及基于划分模型的贝叶斯比较的物种划分。我们发现,对于几乎所有被检查的推断问题,基因座数量是最具影响力的因素。虽然每个物种的序列数量对物种树估计似乎不重要,但对物种划分非常重要。增加位点数量和每个位点的突变率都会增加整个基因座的突变率,这对参数估计有相同的影响,但序列长度对物种树估计的影响大于每个位点的突变率。我们讨论了当数据大小增加时的计算成本,并提供了关于基因组数据抽样的指南,以使全似然推断方法得以应用。