Institut de Biologia Evolutiva, (CSIC-Universitat Pompeu Fabra), PRBB, Doctor Aiguader 88, Barcelona, Catalonia 08003, Spain.
CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028, Barcelona, Spain.
Gigascience. 2017 Nov 1;6(11):1-6. doi: 10.1093/gigascience/gix098.
The chimpanzee is arguably the most important species for the study of human origins. A key resource for these studies is a high-quality reference genome assembly; however, as with most mammalian genomes, the current iteration of the chimpanzee reference genome assembly is highly fragmented. In the current iteration of the chimpanzee reference genome assembly (Pan_tro_2.1.4), the sequence is scattered across more then 183 000 contigs, incorporating more than 159 000 gaps, with a genome-wide contig N50 of 51 Kbp. In this work, we produce an extensive and diverse array of sequencing datasets to rapidly assemble a new chimpanzee reference that surpasses previous iterations in bases represented and organized in large scaffolds. To this end, we show substantial improvements over the current release of the chimpanzee genome (Pan_tro_2.1.4) by several metrics, such as increased contiguity by >750% and 300% on contigs and scaffolds, respectively, and closure of 77% of gaps in the Pan_tro_2.1.4 assembly gaps spanning >850 Kbp of the novel coding sequence based on RNASeq data. We further report more than 2700 genes that had putatively erroneous frame-shift predictions to human in Pan_tro_2.1.4 and show a substantial increase in the annotation of repetitive elements. We apply a simple 3-way hybrid approach to considerably improve the reference genome assembly for the chimpanzee, providing a valuable resource for the study of human origins. Furthermore, we produce extensive sequencing datasets that are all derived from the same cell line, generating a broad non-human benchmark dataset.
黑猩猩可以说是研究人类起源最重要的物种。这些研究的一个关键资源是高质量的参考基因组组装;然而,与大多数哺乳动物基因组一样,当前版本的黑猩猩参考基因组组装高度碎片化。在当前版本的黑猩猩参考基因组组装(Pan_tro_2.1.4)中,序列分散在超过 183000 个 contigs 中,包含超过 159000 个 gap,全基因组 contig N50 为 51 Kbp。在这项工作中,我们生成了广泛多样的测序数据集,以快速组装一个新的黑猩猩参考基因组,在代表的碱基和组织在大 scaffolds 方面都超过了以前的版本。为此,我们通过几个指标展示了相对于当前黑猩猩基因组版本(Pan_tro_2.1.4)的显著改进,例如 contigs 和 scaffolds 的连续性分别提高了>750%和 300%,以及 Pan_tro_2.1.4 组装 gap 中 77% gap 的闭合,这些 gap 跨越了基于 RNASeq 数据的 novel coding sequence 的>850 Kbp。我们进一步报告了超过 2700 个在 Pan_tro_2.1.4 中对人类有错误移码预测的基因,并显示出重复元件注释的大幅增加。我们应用简单的三向杂交方法来显著改进黑猩猩的参考基因组组装,为人类起源的研究提供了有价值的资源。此外,我们生成了广泛的测序数据集,这些数据集都来自同一个细胞系,生成了一个广泛的非人类基准数据集。