Department of Ecology and Genetics, University of Oulu, Pentti Kaiteran katu 1, FI-90014, Oulu, Finland.
Department of Zoology, Institute of Ecology and Earth Sciences, University of Tartu, Vanemuise 46, EE-51014 Tartu, Estonia.
Syst Biol. 2018 Nov 1;67(6):925-939. doi: 10.1093/sysbio/syy029.
A rapid shift from traditional Sanger sequencing-based molecular methods to the phylogenomic approach with large numbers of loci is underway. Among phylogenomic methods, restriction site associated DNA (RAD) sequencing approaches have gained much attention as they enable rapid generation of up to thousands of loci randomly scattered across the genome and are suitable for nonmodel species. RAD data sets however suffer from large amounts of missing data and rapid locus dropout along with decreasing relatedness among taxa. The relationship between locus dropout and the amount of phylogenetic information retained in the data has remained largely uninvestigated. Similarly, phylogenetic hypotheses based on RAD have rarely been compared with phylogenetic hypotheses based on multilocus Sanger sequencing, even less so using exactly the same species and specimens. We compared the Sanger-based phylogenetic hypothesis (8 loci; 6172 bp) of 32 species of the diverse moth genus Eupithecia (Lepidoptera, Geometridae) to that based on double-digest RAD sequencing (3256 loci; 726,658 bp). We observed that topologies were largely congruent, with some notable exceptions that we discuss. The locus dropout effect was strong. We demonstrate that number of loci is not a precise measure of phylogenetic information since the number of single-nucleotide polymorphisms (SNPs) may remain low at very shallow phylogenetic levels despite large numbers of loci. As we hypothesize, the number of SNPs and parsimony informative SNPs (PIS) is low at shallow phylogenetic levels, peaks at intermediate levels and, thereafter, declines again at the deepest levels as a result of decay of available loci. Similarly, we demonstrate with empirical data that the locus dropout affects the type of loci retained, the loci found in many species tending to show lower interspecific distances than those shared among fewer species. We also examine the effects of the numbers of loci, SNPs, and PIS on nodal bootstrap support, but could not demonstrate with our data our expectation of a positive correlation between them. We conclude that RAD methods provide a powerful tool for phylogenomics at an intermediate phylogenetic level as indicated by its broad congruence with an eight-gene Sanger data set in a genus of moths. When assessing the quality of the data for phylogenetic inference, the focus should be on the distribution and number of SNPs and PIS rather than on loci.
从传统的基于 Sanger 测序的分子方法向具有大量基因座的系统发育基因组学方法的快速转变正在进行中。在系统发育基因组学方法中,限制性位点相关 DNA(RAD)测序方法引起了广泛关注,因为它们能够快速生成数千个随机散布在基因组中的基因座,并且适合非模式物种。然而,RAD 数据集存在大量缺失数据和快速基因座丢失,以及分类群之间的相关性降低。基因座丢失与数据中保留的系统发育信息量之间的关系在很大程度上尚未得到研究。同样,基于 RAD 的系统发育假设很少与基于多位点 Sanger 测序的系统发育假设进行比较,更不用说使用完全相同的物种和标本进行比较了。我们比较了 32 种不同的蛾类 Eupithecia 属(鳞翅目,尺蛾科)的基于 Sanger 的系统发育假设(8 个基因座;6172bp)和基于双酶切 RAD 测序的系统发育假设(3256 个基因座;726658bp)。我们观察到拓扑结构基本一致,但也有一些值得注意的例外,我们将在讨论中讨论。基因座丢失效应很强。我们证明,基因座数量并不是系统发育信息量的精确衡量标准,因为尽管基因座数量很大,但在非常浅的系统发育水平上单核苷酸多态性(SNP)的数量可能仍然很低。正如我们假设的那样,在浅系统发育水平上 SNP 和简约信息 SNP(PIS)的数量较低,在中等水平上达到峰值,然后由于可用基因座的衰减再次下降。同样,我们用实证数据证明了基因座丢失会影响保留的基因座类型,许多物种中发现的基因座往往比在较少物种中共享的基因座具有更低的种间距离。我们还检查了基因座数量、SNP 和 PIS 对节点自举支持的影响,但我们的数据无法证明我们期望它们之间存在正相关关系。我们的结论是,RAD 方法在一个中等的系统发育水平上为系统发育基因组学提供了一个强大的工具,这从它与蛾类一个属的八个基因 Sanger 数据集的广泛一致性中可以看出。在评估用于系统发育推断的数据质量时,重点应该放在 SNP 和 PIS 的分布和数量上,而不是基因座上。