Electrical Engineering Department, Columbia University, New York NY 10027, USA.
BMC Bioinformatics. 2013 Sep 8;14:270. doi: 10.1186/1471-2105-14-270.
DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight.
We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances.
We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.
在全基因组关联研究中,DNA 池化是一种具有成本效益的替代方法。在 DNA 池化中,将来自不同个体的等量 DNA 混合到一个样本中,并在单个基因型实验中观察每个位置的每个等位基因的频率。除了单一位点分析之外,从 pooled 数据中识别单倍型频率在这些研究中具有单独的意义,因为单倍型可以增加统计效力并提供额外的见解。
我们开发了一种基于 haplotypes 字典中 DNA 池稀疏表示的从 pooled DNA 数据中进行最大简约单倍型频率估计的方法。还提出了针对数据存在噪声甚至缺失的情况的扩展。该方法首先应用于基于 AGT 基因的单倍型及其相关频率的模拟数据。我们进一步在包含来自 HapMap CEU 人群前 7Mb 的 SNPs 的数据集上评估我们的方法。为了测试所提出方法的扩展,在数据集上进一步引入了噪声和缺失数据。还将 HIPPO 和 HAPLOPOOL 应用于这些数据集以比较性能。
我们在与个体基因分型相比更有效的 pooling 场景下评估我们的方法;也就是说,在包含少量个体的 pooled 的数据集上。我们表明,在这种情况下,我们的方法优于 HIPPO 和 HAPLOPOOL 等最先进的方法。