Department of Integrative Biology, University of California, Berkeley, Berkeley, California 94720, USA;
Genome Res. 2013 Nov;23(11):1852-61. doi: 10.1101/gr.157388.113. Epub 2013 Aug 15.
Most methods for next-generation sequencing (NGS) data analyses incorporate information regarding allele frequencies using the assumption of Hardy-Weinberg equilibrium (HWE) as a prior. However, many organisms including those that are domesticated, partially selfing, or with asexual life cycles show strong deviations from HWE. For such species, and specially for low-coverage data, it is necessary to obtain estimates of inbreeding coefficients (F) for each individual before calling genotypes. Here, we present two methods for estimating inbreeding coefficients from NGS data based on an expectation-maximization (EM) algorithm. We assess the impact of taking inbreeding into account when calling genotypes or estimating the site frequency spectrum (SFS), and demonstrate a marked increase in accuracy on low-coverage highly inbred samples. We demonstrate the applicability and efficacy of these methods in both simulated and real data sets.
大多数下一代测序(NGS)数据分析方法都利用哈迪-温伯格平衡(HWE)的假设作为先验信息来整合等位基因频率的信息。然而,包括那些经过驯化的、部分自交的或具有无性生殖周期的生物体在内,它们都显示出与 HWE 有很大的偏离。对于这些物种,特别是对于低覆盖率的数据,在调用基因型之前,有必要为每个个体获得近交系数(F)的估计值。在这里,我们提出了两种基于期望最大化(EM)算法从 NGS 数据中估计近交系数的方法。我们评估了在调用基因型或估计位点频率谱(SFS)时考虑近交的影响,并在低覆盖率高度近交的样本中显示出显著的准确性提高。我们在模拟和真实数据集上展示了这些方法的适用性和有效性。