Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, WA 98195, USA.
Am J Hum Genet. 2013 Nov 7;93(5):840-51. doi: 10.1016/j.ajhg.2013.09.014. Epub 2013 Oct 24.
Existing methods for identity by descent (IBD) segment detection were designed for SNP array data, not sequence data. Sequence data have a much higher density of genetic variants and a different allele frequency distribution, and can have higher genotype error rates. Consequently, best practices for IBD detection in SNP array data do not necessarily carry over to sequence data. We present a method, IBDseq, for detecting IBD segments in sequence data and a method, SEQERR, for estimating genotype error rates at low-frequency variants by using detected IBD. The IBDseq method estimates probabilities of genotypes observed with error for each pair of individuals under IBD and non-IBD models. The ratio of estimated probabilities under the two models gives a LOD score for IBD. We evaluate several IBD detection methods that are fast enough for application to sequence data (IBDseq, Beagle Refined IBD, PLINK, and GERMLINE) under multiple parameter settings, and we show that IBDseq achieves high power and accuracy for IBD detection in sequence data. The SEQERR method estimates genotype error rates by comparing observed and expected rates of pairs of homozygote and heterozygote genotypes at low-frequency variants in IBD segments. We demonstrate the accuracy of SEQERR in simulated data, and we apply the method to estimate genotype error rates in sequence data from the UK10K and 1000 Genomes projects.
现有的通过血缘关系进行身份鉴定(IBD)片段检测方法是专为 SNP 芯片数据设计的,而不是序列数据。序列数据具有更高的遗传变异密度和不同的等位基因频率分布,并且可能具有更高的基因型错误率。因此,SNP 芯片数据中 IBD 检测的最佳实践不一定适用于序列数据。我们提出了一种用于检测序列数据中 IBD 片段的方法 IBDseq,以及一种通过检测到的 IBD 来估计低频变异基因型错误率的方法 SEQERR。IBDseq 方法为每个 IBD 和非 IBD 模型下的个体对估计观察到的具有错误的基因型的概率。两个模型下的估计概率之比为 IBD 提供了 LOD 得分。我们评估了几种在多种参数设置下足够快适用于序列数据的 IBD 检测方法(IBDseq、Beagle Refined IBD、PLINK 和 GERMLINE),并表明 IBDseq 可实现序列数据中 IBD 检测的高功效和准确性。SEQERR 方法通过比较 IBD 片段中低频变异的同型和杂合基因型的观察到的和预期的比率来估计基因型错误率。我们在模拟数据中证明了 SEQERR 的准确性,并将该方法应用于 UK10K 和 1000 基因组计划序列数据中估计基因型错误率。