Hazkani-Covo Einat, Martin William F
Department of Natural and Life Sciences, The Open University of Israel, Ra'anana, Israel.
Institute of Molecular Evolution, Heinrich-Heine University, Düsseldorf, Germany.
Genome Biol Evol. 2017 May 1;9(5):1190-1203. doi: 10.1093/gbe/evx078.
Fragments of organelle genomes are often found as insertions in nuclear DNA. These fragments of mitochondrial DNA (numts) and plastid DNA (nupts) are ubiquitous components of eukaryotic genomes. They are, however, often edited out during the genome assembly process, leading to systematic underestimation of their frequency. Numts and nupts, once inserted, can become further fragmented through subsequent insertion of mobile elements or other recombinational events that disrupt the continuity of the inserted sequence relative to the genuine organelle DNA copy. Because numts and nupts are typically identified through sequence comparison tools such as BLAST, disruption of insertions into smaller fragments can lead to systematic overestimation of numt and nupt frequencies. Accurate identification of numts and nupts is important, however, both for better understanding of their role during evolution, and for monitoring their increasingly evident role in human disease. Human populations are polymorphic for 141 numt loci, five numts are causal to genetic disease, and cancer genomic studies are revealing an abundance of numts associated with tumor progression. Here, we report investigation of salient parameters involved in obtaining accurate estimates of numt and nupt numbers in genome sequence data. Numts and nupts from 44 sequenced eukaryotic genomes reveal lineage-specific differences in the number, relative age and frequency of insertional events as well as lineage-specific dynamics of their postinsertional fragmentation. Our findings outline the main technical parameters influencing accurate identification and frequency estimation of numts in genomic studies pertinent to both evolution and human health.
细胞器基因组片段常常作为插入片段存在于核DNA中。这些线粒体DNA片段(核线粒体DNA,numts)和质体DNA片段(核质体DNA,nupts)是真核生物基因组中普遍存在的成分。然而,在基因组组装过程中,它们常常被编辑去除,导致对其频率的系统性低估。numts和nupts一旦插入,可能会因移动元件的后续插入或其他重组事件而进一步碎片化,这些事件会破坏插入序列相对于真正细胞器DNA拷贝的连续性。由于numts和nupts通常是通过诸如BLAST等序列比对工具来识别的,插入片段断裂成较小片段会导致对numt和nupt频率的系统性高估。然而,准确识别numts和nupts很重要,这既有助于更好地理解它们在进化过程中的作用,也有助于监测它们在人类疾病中日益明显的作用。人类群体在141个核线粒体DNA位点上具有多态性,5个核线粒体DNA与遗传疾病有关,癌症基因组研究也揭示了大量与肿瘤进展相关的核线粒体DNA。在此,我们报告了对基因组序列数据中获得准确的核线粒体DNA和核质体DNA数量估计所涉及的显著参数的研究。来自44个已测序真核生物基因组的核线粒体DNA和核质体DNA显示出插入事件的数量、相对年龄和频率以及插入后碎片化的谱系特异性差异。我们的研究结果概述了在与进化和人类健康相关的基因组研究中影响核线粒体DNA准确识别和频率估计的主要技术参数。