G3 (Bethesda). 2011 Jun;1(1):35-42. doi: 10.1534/g3.111.000174. Epub 2011 Jun 1.
Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals' diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1-2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12-18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.
准确的单倍型和二倍型(单倍型对)信息对于群体遗传学分析是必需的;然而,微阵列不能提供拷贝数变异(CNV)基因座的单倍型或二倍型的数据;它们只能提供二倍型或无相位序列基因型(例如,AAB,不同于单核苷酸多态性的 AB)的总拷贝数的数据。此外,由于噪声的影响,当源自不同拷贝数或基因型的微阵列信号强度不能明显分离时,这些拷贝数或基因型通常会被错误地确定。在这里,我们报告了一种算法,该算法利用信号强度可能源自不同潜在拷贝数或基因型的概率,从嘈杂的微阵列数据中推断出多个基因座的 CNV 单倍型和个体的二倍型。基于已知的二倍型和从真实微阵列数据中获得的误差模型进行模拟研究,我们证明了这种概率方法能够成功地从嘈杂的数据中进行准确的推断(错误率:1-2%),而以前的确定性方法则失败了(错误率:12-18%)。将该算法应用于真实的微阵列数据,我们估计了 100 个人的 1486 个 CNV 区域中的单倍型频率和二倍型。我们的算法将有助于准确的群体遗传学分析和强大的 CNV 疾病关联研究。