Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
Nat Genet. 2019 Jan;51(1):30-35. doi: 10.1038/s41588-018-0273-y. Epub 2018 Nov 19.
We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.
我们使用了一个深度测序数据集,其中包含 910 名全部来自非洲血统的个体,构建了一组存在于这些个体中但在参考人类基因组中缺失的 DNA 序列。我们将这 910 个人的 11.9 万亿条读取与参考基因组(GRCh38)进行比对,收集所有无法比对的读取,并将这些读取组装成连续的序列(contigs)。然后,我们将所有 contigs 相互比较,以确定一组代表参考基因组中缺失的非洲泛基因组区域的独特序列。我们的分析揭示了在非洲血统人群中存在的 125715 个独特 contigs 中,有 296485284bp,这表明非洲泛基因组包含比当前人类参考基因组多约 10%的 DNA。尽管几乎所有这些序列的功能意义都未知,但 387 个新的 contigs 位于 315 个不同的蛋白编码基因内,其余的似乎位于基因间。