Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK.
Mol Ecol Resour. 2023 Jul;23(5):1002-1013. doi: 10.1111/1755-0998.13764. Epub 2023 Feb 15.
Inserts of DNA from extranuclear sources, such as organelles and microbes, are common in eukaryote nuclear genomes. However, sequence similarity between the nuclear and extranuclear DNA, and a history of multiple insertions, make the assembly of these regions challenging. Consequently, the number, sequence and location of these vagrant DNAs cannot be reliably inferred from the genome assemblies of most organisms. We introduce two statistical methods to estimate the abundance of nuclear inserts even in the absence of a nuclear genome assembly. The first (intercept method) only requires low-coverage (<1×) sequencing data, as commonly generated for population studies of organellar and ribosomal DNAs. The second method additionally requires that a subset of the individuals carry extranuclear DNA with diverged genotypes. We validated our intercept method using simulations and by re-estimating the frequency of human NUMTs (nuclear mitochondrial inserts). We then applied it to the grasshopper Podisma pedestris, exceptional for both its large genome size and reports of numerous NUMT inserts, estimating that NUMTs make up 0.056% of the nuclear genome, equivalent to >500 times the mitochondrial genome size. We also re-analysed a museomics data set of the parrot Psephotellus varius, obtaining an estimate of only 0.0043%, in line with reports from other species of bird. Our study demonstrates the utility of low-coverage high-throughput sequencing data for the quantification of nuclear vagrant DNAs. Beyond quantifying organellar inserts, these methods could also be used on endosymbiont-derived sequences. We provide an R implementation of our methods called "vagrantDNA" and code to simulate test data sets.
来自核外来源(如细胞器和微生物)的 DNA 插入物在真核核基因组中很常见。然而,核 DNA 和核外 DNA 之间的序列相似性以及多次插入的历史,使得这些区域的组装具有挑战性。因此,这些游荡 DNA 的数量、序列和位置不能从大多数生物体的基因组组装中可靠推断。我们引入了两种统计方法来估计核插入的丰度,即使在没有核基因组组装的情况下也是如此。第一种(截距法)只需要低覆盖度(<1×)测序数据,通常用于细胞器和核糖体 DNA 的群体研究。第二种方法还要求一部分个体携带具有分化基因型的核外 DNA。我们使用模拟和重新估计人类 NUMT(核线粒体插入物)的频率来验证我们的截距法。然后,我们将其应用于蚱蜢 Podisma pedestris,其基因组大小巨大且有大量 NUMT 插入物的报道,估计 NUMT 占核基因组的 0.056%,相当于线粒体基因组大小的 500 多倍。我们还重新分析了鹦鹉 Psephotellus varius 的 museomics 数据集,得到的估计值仅为 0.0043%,与其他鸟类物种的报告一致。我们的研究证明了低覆盖度高通量测序数据在量化核游荡 DNA 方面的实用性。除了量化细胞器插入物外,这些方法还可以用于内共生体衍生序列。我们提供了一种名为“vagrantDNA”的 R 实现和模拟测试数据集的代码。