Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, United Kingdom.
Genome Res. 2011 Jun;21(6):952-60. doi: 10.1101/gr.113084.110. Epub 2010 Oct 27.
Reductions in the cost of sequencing have enabled whole-genome sequencing to identify sequence variants segregating in a population. An efficient approach is to sequence many samples at low coverage, then to combine data across samples to detect shared variants. Here, we present methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. For each population, we first collect SNP candidates based on independent sequence calls per site. We then use MARGARITA with genotype or phased haplotype data from the same samples to collect 20 ancestral recombination graphs (ARGs). We refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. Using a population genetic prior distribution on tree-branch length and Bayesian inference, we determine a posterior probability of the SNP being real and also the most probable phased genotype call for each individual. We present experiments on both simulation data and real data from the 1000 Genomes Project to prove the applicability of the methods. We also explore the relative tradeoff between sequencing depth and the number of sequenced samples.
测序成本的降低使全基因组测序能够识别在人群中分离的序列变异。一种有效的方法是对许多样本进行低覆盖率测序,然后合并样本数据以检测共享的变异。在这里,我们提出了从低覆盖率测序数据中发现和分型单核苷酸多态性(SNP)位点的方法,利用共享单倍型(连锁不平衡)信息。对于每个群体,我们首先根据每个位点的独立序列调用收集 SNP 候选者。然后,我们使用 MARGARITA 与来自相同样本的基因型或分相单倍型数据一起,收集 20 个祖先重组图(ARG)。我们通过考虑在左右侧翼基因型位点的 20 个 ARG 推断出的 40 个边缘祖先树的内部分支处可能发生的突变,来细化 SNP 候选者的后验概率。通过对树分支长度的种群遗传先验分布和贝叶斯推断,我们确定 SNP 为真实的后验概率,以及每个个体的最可能分相基因型调用。我们在模拟数据和 1000 基因组计划的真实数据上进行实验,以证明该方法的适用性。我们还探讨了测序深度和测序样本数量之间的相对权衡。