Ecological Genetics Research Unit, Research Programme in Organismal and Evolutionary Biology, Faculty of Biological and Environmental Sciences, Department of Biosciences, University of Helsinki, Helsinki, Finland.
Mol Ecol Resour. 2018 Jul;18(4):809-824. doi: 10.1111/1755-0998.12893. Epub 2018 May 7.
Genomewide association studies (GWAS) aim to identify genetic markers strongly associated with quantitative traits by utilizing linkage disequilibrium (LD) between candidate genes and markers. However, because of LD between nearby genetic markers, the standard GWAS approaches typically detect a number of correlated SNPs covering long genomic regions, making corrections for multiple testing overly conservative. Additionally, the high dimensionality of modern GWAS data poses considerable challenges for GWAS procedures such as permutation tests, which are computationally intensive. We propose a cluster-based GWAS approach that first divides the genome into many large nonoverlapping windows and uses linkage disequilibrium network analysis in combination with principal component (PC) analysis as dimensional reduction tools to summarize the SNP data to independent PCs within clusters of loci connected by high LD. We then introduce single- and multilocus models that can efficiently conduct the association tests on such high-dimensional data. The methods can be adapted to different model structures and used to analyse samples collected from the wild or from biparental F populations, which are commonly used in ecological genetics mapping studies. We demonstrate the performance of our approaches with two publicly available data sets from a plant (Arabidopsis thaliana) and a fish (Pungitius pungitius), as well as with simulated data.
全基因组关联研究(GWAS)旨在通过利用候选基因和标记之间的连锁不平衡(LD),来识别与数量性状强相关的遗传标记。然而,由于附近遗传标记之间存在 LD,标准的 GWAS 方法通常会检测到覆盖长基因组区域的大量相关 SNP,使得多重检验的校正过于保守。此外,现代 GWAS 数据的高维性对 GWAS 程序(如置换检验)提出了相当大的挑战,置换检验计算量很大。我们提出了一种基于聚类的 GWAS 方法,该方法首先将基因组划分为许多大的非重叠窗口,并使用连锁不平衡网络分析结合主成分(PC)分析作为降维工具,将 SNP 数据汇总到由高 LD 连接的位点簇内的独立 PC 中。然后,我们引入了单和多基因座模型,可以有效地对这种高维数据进行关联检验。该方法可以适应不同的模型结构,并用于分析来自野生或双亲 F 群体的样本,这些样本常用于生态遗传学作图研究。我们使用来自植物(拟南芥)和鱼类(丁鱥)的两个公开可用数据集以及模拟数据来演示我们方法的性能。