Prabhu Snehit, Pe'er Itsik
Department of Computer Science, Columbia University, New York, New York 10025, USA.
Genome Res. 2009 Jul;19(7):1254-61. doi: 10.1101/gr.088559.108. Epub 2009 May 15.
Resequencing genomic DNA from pools of individuals is an effective strategy to detect new variants in targeted regions and compare them between cases and controls. There are numerous ways to assign individuals to the pools on which they are to be sequenced. The naïve, disjoint pooling scheme (many individuals to one pool) in predominant use today offers insight into allele frequencies, but does not offer the identity of an allele carrier. We present a framework for overlapping pool design, where each individual sample is resequenced in several pools (many individuals to many pools). Upon discovering a variant, the set of pools where this variant is observed reveals the identity of its carrier. We formalize the mathematical framework for such pool designs and list the requirements from such designs. We specifically address three practical concerns for pooled resequencing designs: (1) false-positives due to errors introduced during amplification and sequencing; (2) false-negatives due to undersampling particular alleles aggravated by nonuniform coverage; and consequently, (3) ambiguous identification of individual carriers in the presence of errors. We build on theory of error-correcting codes to design pools that overcome these pitfalls. We show that in practical parameters of resequencing studies, our designs guarantee high probability of unambiguous singleton carrier identification while maintaining the features of naïve pools in terms of sensitivity, specificity, and the ability to estimate allele frequencies. We demonstrate the ability of our designs in extracting rare variations using short read data from the 1000 Genomes Pilot 3 project.
对个体样本池中的基因组DNA进行重测序是一种在目标区域检测新变异并在病例组和对照组之间进行比较的有效策略。有多种方法可将个体分配到要进行测序的样本池中。目前主要使用的简单、不相交的混合方案(多个个体放入一个样本池)能提供等位基因频率信息,但无法确定等位基因携带者的身份。我们提出了一种重叠样本池设计框架,即每个个体样本在多个样本池中进行重测序(多个个体放入多个样本池)。发现变异后,观察到该变异的样本池集合就能揭示其携带者的身份。我们将这种样本池设计的数学框架形式化,并列出此类设计的要求。我们特别针对混合重测序设计中的三个实际问题进行了探讨:(1)由于扩增和测序过程中引入的错误导致的假阳性;(2)由于特定等位基因抽样不足且覆盖不均匀而加剧的假阴性;以及因此产生的(3)在存在错误的情况下个体携带者身份的模糊识别。我们基于纠错码理论来设计样本池,以克服这些缺陷。我们表明,在重测序研究的实际参数下,我们的设计保证了明确识别单倍型携带者的高概率,同时在敏感性、特异性以及估计等位基因频率的能力方面保持了简单样本池的特点。我们利用来自千人基因组计划先导3项目的短读长数据,展示了我们的设计提取罕见变异的能力。