Zaitlen Noah A, Kang Hyun Min, Feolo Michael L, Sherry Stephen T, Halperin Eran, Eskin Eleazar
Bioinformatics Program, University of California, San Diego, La Jolla, California 92093, USA.
Genome Res. 2005 Nov;15(11):1594-600. doi: 10.1101/gr.4297805.
In the attempt to understand human variation and the genetic basis of complex disease, a tremendous number of single nucleotide polymorphisms (SNPs) have been discovered and deposited into NCBI's dbSNP public database. More than 2.7 million SNPs in the database have genotype information. This data provides an invaluable resource for understanding the structure of human variation and the design of genetic association studies. The genotypes deposited to dbSNP are unphased, and thus, the haplotype information is unknown. We applied the phasing method HAP to obtain the haplotype information, block partitions, and tag SNPs for all publicly available genotype data and deposited this information into the dbSNP database. We also deposited the orthologous chimpanzee reference sequence for each predicted haplotype block computed using the UCSC BLASTZ alignments of human and chimpanzee. Using dbSNP, researchers can now easily perform analyses using multiple genotype data sets from the same genomic regions. Dense and sparse genotype data sets from the same region were combined to show that the number of common haplotypes is significantly underestimated in whole genome data sets, while the predicted haplotypes over the common SNPs are consistent between studies. To validate the accuracy of the predictions, we bench-marked HAP's running time and phasing accuracy against PHASE. Although HAP is slightly less accurate than PHASE, HAP is over 1000 times faster than PHASE, making it suitable for application to the entire set of genotypes in dbSNP.
为了理解人类变异以及复杂疾病的遗传基础,人们发现了大量的单核苷酸多态性(SNP),并将其存入美国国立生物技术信息中心(NCBI)的dbSNP公共数据库。该数据库中有超过270万个SNP拥有基因型信息。这些数据为理解人类变异的结构以及遗传关联研究的设计提供了极为宝贵的资源。存入dbSNP的基因型是未分型的,因此单倍型信息未知。我们应用分型方法HAP来获取所有公开可用基因型数据的单倍型信息、区域划分和标签SNP,并将这些信息存入dbSNP数据库。我们还存入了使用人类和黑猩猩的加州大学圣克鲁兹分校(UCSC)BLASTZ比对计算出的每个预测单倍型区域的直系黑猩猩参考序列。通过dbSNP,研究人员现在可以轻松地使用来自同一基因组区域的多个基因型数据集进行分析。来自同一区域的密集和稀疏基因型数据集被合并起来,结果表明在全基因组数据集中常见单倍型的数量被显著低估,而不同研究中常见SNP上的预测单倍型是一致的。为了验证预测的准确性,我们将HAP的运行时间和分型准确性与PHASE进行了基准测试。尽管HAP的准确性略低于PHASE,但HAP的速度比PHASE快1000多倍,这使得它适用于dbSNP中的所有基因型。