Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA 02115, USA.
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA.
Bioinformatics. 2017 Jul 1;33(13):1972-1979. doi: 10.1093/bioinformatics/btx109.
In order to minimize the effects of genetic confounding on the analysis of high-throughput genetic association studies, e.g. (whole-genome) sequencing (WGS) studies, genome-wide association studies (GWAS), etc., we propose a general framework to assess and to test formally for genetic heterogeneity among study subjects. As the approach fully utilizes the recent ancestor information captured by rare variants, it is especially powerful in WGS studies. Even for relatively moderate sample sizes, the proposed testing framework is able to identify study subjects that are genetically too similar, e.g. cryptic relationships, or that are genetically too different, e.g. population substructure. The approach is computationally fast, enabling the application to whole-genome sequencing data, and straightforward to implement.
Simulation studies illustrate the overall performance of our approach. In an application to the 1000 Genomes Project, we outline an analysis/cleaning pipeline that utilizes our approach to formally assess whether study subjects are related and whether population substructure is present. In the analysis of the 1000 Genomes Project data, our approach revealed subjects that are most likely related, but had previously passed standard qc-filters.
An implementation of our method, Similarity Test for Estimating Genetic Outliers (STEGO), is available in the R package stego from Github at https://github.com/dschlauch/stego .
Supplementary data are available at Bioinformatics online.
为了最大限度地减少遗传混杂对高通量遗传关联研究(例如全基因组测序 [WGS] 研究、全基因组关联研究 [GWAS] 等)分析的影响,我们提出了一种评估和正式检验研究对象之间遗传异质性的通用框架。由于该方法充分利用了稀有变异所捕获的近期祖先信息,因此在 WGS 研究中特别强大。即使对于相对适中的样本量,所提出的检验框架也能够识别遗传上过于相似的研究对象,例如隐匿关系,或者遗传上过于不同的研究对象,例如人口亚结构。该方法计算速度快,能够应用于全基因组测序数据,并且易于实现。
模拟研究说明了我们方法的整体性能。在对 1000 个基因组计划的应用中,我们概述了一种分析/清理管道,该管道利用我们的方法正式评估研究对象是否相关,以及是否存在人口亚结构。在对 1000 个基因组计划数据的分析中,我们的方法揭示了最有可能相关但先前通过标准 QC 过滤器的对象。
我们方法的实现,即用于估计遗传异常值的相似性检验(STEGO),可在 R 包 stego 中从 Github 获得,网址为 https://github.com/dschlauch/stego 。
补充数据可在生物信息学在线获得。