Brzustowicz L M, Mérette C, Xie X, Townsend L, Gilliam T C, Ott J
Department of Psychiatry, Columbia University, College of Physicians and Surgeons, NY 10032.
Am J Hum Genet. 1993 Nov;53(5):1137-45.
Errors in genotyping data have been shown to have a significant effect on the estimation of recombination fractions in high-resolution genetic maps. Previous estimates of errors in existing databases have been limited to the analysis of relatively few markers and have suggested rates in the range 0.5%-1.5%. The present study capitalizes on the fact that within the Centre d'Etude du Polymorphisme Humain (CEPH) collection of reference families, 21 individuals are members of more than one family, with separate DNA samples provided by CEPH for each appearance of these individuals. By comparing the genotypes of these individuals in each of the families in which they occur, an estimated error rate of 1.4% was calculated for all loci in the version 4.0 CEPH database. Removing those individuals who were clearly identified by CEPH as appearing in more than one family resulted in a 3.0% error rate for the remaining samples, suggesting that some error checking of the identified repeated individuals may occur prior to data submission. An error rate of 3.0% for version 4.0 data was also obtained for four chromosome 5 markers that were retyped through the entire CEPH collection. The effects of these errors on a multipoint map were significant, with a total sex-averaged length of 36.09 cM with the errors, and 19.47 cM with the errors corrected. Several statistical approaches to detect and allow for errors during linkage analysis are presented. One method, which identified families containing possible errors on the basis of the impact on the maximum lod score, showed particular promise, especially when combined with the limited retyping of the identified families. The impact of the demonstrated error rate in an established genotype database on high-resolution mapping is significant, raising the question of the overall value of incorporating such existing data into new genetic maps.
基因分型数据中的错误已被证明对高分辨率遗传图谱中重组率的估计有显著影响。先前对现有数据库中错误的估计仅限于对相对较少标记的分析,且错误率在0.5% - 1.5%之间。本研究利用了这样一个事实,即在人类多态性研究中心(CEPH)的参考家系集合中,有21个人属于不止一个家系,CEPH为这些人每次出现都提供了单独的DNA样本。通过比较这些个体在其所在的每个家系中的基因型,计算出CEPH数据库4.0版本中所有位点的估计错误率为1.4%。去除那些被CEPH明确认定出现在不止一个家系中的个体后,其余样本的错误率为3.0%,这表明在数据提交之前可能对已识别的重复个体进行了一些错误检查。通过对整个CEPH集合重新分型的5号染色体上的四个标记,也得到了4.0版本数据3.0%的错误率。这些错误对多点图谱的影响是显著的,存在错误时性平均总长度为36.09厘摩,错误校正后为19.47厘摩。本文介绍了几种在连锁分析过程中检测和处理错误的统计方法。其中一种方法,即根据对最大似然比分数的影响来识别可能存在错误的家系,显示出特别的前景,尤其是与对已识别家系进行有限的重新分型相结合时。在一个既定的基因型数据库中所证明的错误率对高分辨率图谱的影响是显著 的,这就引发了将此类现有数据纳入新的遗传图谱的整体价值的问题。