Dou Jinzhuang, Sun Baoluo, Sim Xueling, Hughes Jason D, Reilly Dermot F, Tai E Shyong, Liu Jianjun, Wang Chaolong
Computational and Systems Biology, Genome Institute of Singapore, Singapore, Singapore.
Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Singapore.
PLoS Genet. 2017 Sep 29;13(9):e1007021. doi: 10.1371/journal.pgen.1007021. eCollection 2017 Sep.
Knowledge of biological relatedness between samples is important for many genetic studies. In large-scale human genetic association studies, the estimated kinship is used to remove cryptic relatedness, control for family structure, and estimate trait heritability. However, estimation of kinship is challenging for sparse sequencing data, such as those from off-target regions in target sequencing studies, where genotypes are largely uncertain or missing. Existing methods often assume accurate genotypes at a large number of markers across the genome. We show that these methods, without accounting for the genotype uncertainty in sparse sequencing data, can yield a strong downward bias in kinship estimation. We develop a computationally efficient method called SEEKIN to estimate kinship for both homogeneous samples and heterogeneous samples with population structure and admixture. Our method models genotype uncertainty and leverages linkage disequilibrium through imputation. We test SEEKIN on a whole exome sequencing dataset (WES) of Singapore Chinese and Malays, which involves substantial population structure and admixture. We show that SEEKIN can accurately estimate kinship coefficient and classify genetic relatedness using off-target sequencing data down sampled to 0.15X depth. In application to the full WES dataset without down sampling, SEEKIN also outperforms existing methods by properly analyzing shallow off-target data (0.75X). Using both simulated and real phenotypes, we further illustrate how our method improves estimation of trait heritability for WES studies.
样本间生物学亲缘关系的知识对许多基因研究都很重要。在大规模人类基因关联研究中,估计的亲缘关系用于消除潜在的相关性、控制家庭结构并估计性状遗传力。然而,对于稀疏测序数据(如目标测序研究中来自脱靶区域的数据,其基因型大多不确定或缺失),亲缘关系的估计具有挑战性。现有方法通常假定全基因组大量标记处的基因型准确无误。我们表明,这些方法在不考虑稀疏测序数据中基因型不确定性的情况下,会在亲缘关系估计中产生强烈的向下偏差。我们开发了一种计算效率高的方法,称为SEEKIN,用于估计具有群体结构和混合的同质样本和异质样本的亲缘关系。我们的方法对基因型不确定性进行建模,并通过插补利用连锁不平衡。我们在新加坡华人和马来人的全外显子测序数据集(WES)上测试了SEEKIN,该数据集涉及大量的群体结构和混合。我们表明,SEEKIN可以准确估计亲缘系数,并使用下采样至约0.15X深度的脱靶测序数据对遗传相关性进行分类。在应用于未下采样的完整WES数据集时,SEEKIN通过正确分析浅层脱靶数据(约0.75X)也优于现有方法。使用模拟和真实表型,我们进一步说明了我们的方法如何改进WES研究中性状遗传力的估计。