Osabe Dai, Tanahashi Toshihito, Nomura Kyoko, Shinohara Shuichi, Nakamura Naoto, Yoshikawa Toshikazu, Shiota Hiroshi, Keshavarz Parvaneh, Yamaguchi Yuka, Kunika Kiyoshi, Moritani Maki, Inoue Hiroshi, Itakura Mitsuo
Department of Bioinformatics, Division of Life Science Systems, Fujitsu Limited, Higashishinbashi, Minato-ku, Tokyo, Japan.
BMC Bioinformatics. 2007 Jun 14;8:200. doi: 10.1186/1471-2105-8-200.
Genome-wide maps of linkage disequilibrium (LD) and haplotypes have been created for different populations. Substantial sharing of the boundaries and haplotypes among populations was observed, but haplotype variations have also been reported across populations. Conflicting observations on the extent and distribution of haplotypes require careful examination. The mechanisms that shape haplotypes have not been fully explored, although the effect of sample size has been implicated. We present a close examination of the effect of sample size on haplotype blocks using an original computational simulation.
A region spanning 19.31 Mb on chromosome 20q was genotyped for 1,147 SNPs in 725 Japanese subjects. One region of 445 kb exhibiting a single strong LD value (average |D'|; 0.94) was selected for the analysis of sample size effect on haplotype structure. Three different block definitions (recombination-based, LD-based, and diversity-based) were exploited to create simulations for block identification with theta value from real genotyping data. As a result, it was quite difficult to estimate a haplotype block for data with less than 200 samples. Attainment of a reliable haplotype structure with 50 samples was not possible, although the simulation was repeated 10,000 times.
These analyses underscored the difficulties of estimating haplotype blocks. To acquire a reliable result, it would be necessary to increase sample size more than 725 and to repeat the simulation 3,000 times. Even in one genomic region showing a high LD value, the haplotype block might be fragile. We emphasize the importance of applying careful confidence measures when using the estimated haplotype structure in biomedical research.
已针对不同人群构建了全基因组连锁不平衡(LD)图谱和单倍型图谱。观察到不同人群之间在边界和单倍型上有大量共享,但也有报道称不同人群间存在单倍型变异。关于单倍型范围和分布的相互矛盾的观察结果需要仔细研究。尽管样本量的影响已被提及,但塑造单倍型的机制尚未得到充分探索。我们使用原始的计算模拟对样本量对单倍型块的影响进行了仔细研究。
对725名日本受试者的20号染色体上跨度为19.31 Mb的区域进行了1147个单核苷酸多态性(SNP)的基因分型。选择了一个445 kb的区域,该区域呈现单一较强的LD值(平均|D'|;0.94),用于分析样本量对单倍型结构的影响。利用三种不同的块定义(基于重组、基于LD和基于多样性),根据实际基因分型数据的θ值创建用于块识别的模拟。结果表明,对于样本量少于200的数据,很难估计单倍型块。尽管模拟重复了10000次,但使用50个样本无法获得可靠的单倍型结构。
这些分析强调了估计单倍型块的困难。为了获得可靠的结果,有必要将样本量增加到725以上,并将模拟重复3000次。即使在一个显示高LD值的基因组区域,单倍型块可能也很脆弱。我们强调在生物医学研究中使用估计的单倍型结构时应用谨慎的置信度测量的重要性。