BGI-Shenzhen, Build 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China.
BGI Genomics, BGI-Shenzhen, Building NO. 7, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen, 518083, China.
Gigascience. 2017 Sep 1;6(9):1-7. doi: 10.1093/gigascience/gix067.
Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects.
下一代测序技术为人类遗传信息提供了高分辨率的洞察力。然而,由于测序成本高昂,之前的研究主要集中在低覆盖率数据上。尽管 1000 基因组计划和单倍型参考联盟都为推断提供了强大的参考面板,但低频和新变体仍然难以在低覆盖率数据的基础上准确发现和调用。深度测序为解决低频和新变体的问题提供了最佳解决方案。虽然外显子组测序也是外显子区域的可行选择,但它不能涵盖非编码区域,有时会导致重要的因果变体缺失。对于汉族人群,大多数变体都是基于 1000 基因组计划的低覆盖率数据发现的。然而,对于任何人群来说,高覆盖率的全基因组测序数据都是有限的,大量低频、人群特异性变体仍然未被描述。我们对来自 1000 基因组计划样本的 90 个无亲缘关系的中国人进行了深度约为×80 的全基因组测序,其中包括 45 个北方汉族人和 45 个南方汉族人。这些 90 个人中有 83 个已经被 1000 基因组计划测序过。我们从这 90 个样本中鉴定出了 12568804 个单核苷酸多态性、2074210 个短插入缺失和 26142 个结构变异。与 1000 基因组计划中的汉族数据相比,我们发现了 7000629 个低频新变体(定义为次要等位基因频率<5%),包括 5813503 个单核苷酸多态性、1169199 个插入缺失和 17927 个结构变异。使用深度测序数据,我们构建了一个大大扩展的汉族基因组遗传变异谱。与 1000 基因组计划相比,这些汉族深度测序数据增强了对大量低频新变体的描述。这将是促进中国遗传学研究和医学发展的宝贵资源。此外,它将为 1000 基因组计划以及其他人类基因组计划提供有价值的补充。