Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.
Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximilian University of Munich, Planegg-Martinsried, Germany.
Mol Ecol Resour. 2018 Nov;18(6):1188-1195. doi: 10.1111/1755-0998.12933. Epub 2018 Aug 16.
The genomics revolution has led to the sequencing of a large variety of nonmodel organisms often referred to as "whole" or "complete" genome assemblies. But how complete are these, really? Here, we use birds as an example for nonmodel vertebrates and find that, although suitable in principle for genomic studies, the current standard of short-read assemblies misses a significant proportion of the expected genome size (7% to 42%; mean 20 ± 9%). In particular, regions with strongly deviating nucleotide composition (e.g., guanine-cytosine-[GC]-rich) and regions highly enriched in repetitive DNA (e.g., transposable elements and satellite DNA) are usually underrepresented in assemblies. However, long-read sequencing technologies successfully characterize many of these underrepresented GC-rich or repeat-rich regions in several bird genomes. For instance, only ~2% of the expected total base pairs are missing in the last chicken reference (galGal5). These assemblies still contain thousands of gaps (i.e., fragmented sequences) because some chromosomal structures (e.g., centromeres) likely contain arrays of repetitive DNA that are too long to bridge with currently available technologies. We discuss how to minimize the number of assembly gaps by combining the latest available technologies with complementary strengths. At last, we emphasize the importance of knowing the location, size and potential content of assembly gaps when making population genetic inferences about adjacent genomic regions.
基因组学革命导致了大量非模式生物的测序,这些生物通常被称为“全”或“完整”基因组组装。但这些组装真的完整吗?在这里,我们以鸟类作为非模式脊椎动物的例子,发现尽管原则上适用于基因组研究,但当前的短读长组装标准会遗漏相当一部分预期的基因组大小(7%到 42%;平均值为 20 ± 9%)。特别是那些具有强烈偏离核苷酸组成(如鸟嘌呤-胞嘧啶-[GC]-丰富)和富含重复 DNA 的区域(如转座元件和卫星 DNA)的区域,在组装中通常是代表性不足的。然而,长读长测序技术成功地在几个鸟类基因组中对这些代表性不足的 GC 丰富或重复丰富的区域进行了特征描述。例如,在最后一个鸡参考基因组(galGal5)中,只有大约 2%的预期总碱基对缺失。这些组装仍然包含数千个缺口(即碎片化序列),因为一些染色体结构(如着丝粒)可能包含大量的重复 DNA 阵列,这些阵列太长,无法用目前可用的技术进行桥接。我们讨论了如何通过将最新的可用技术与互补的优势相结合,来最小化组装缺口的数量。最后,我们强调了当对相邻基因组区域进行群体遗传推断时,了解组装缺口的位置、大小和潜在内容的重要性。