Department of Epidemiology and Biostatistics, Imperial College London, London, United Kingdom.
PLoS One. 2012;7(5):e34861. doi: 10.1371/journal.pone.0034861. Epub 2012 May 2.
The genome-wide association study (GWAS) approach has discovered hundreds of genetic variants associated with diseases and quantitative traits. However, despite clinical overlap and statistical correlation between many phenotypes, GWAS are generally performed one-phenotype-at-a-time. Here we compare the performance of modelling multiple phenotypes jointly with that of the standard univariate approach. We introduce a new method and software, MultiPhen, that models multiple phenotypes simultaneously in a fast and interpretable way. By performing ordinal regression, MultiPhen tests the linear combination of phenotypes most associated with the genotypes at each SNP, and thus potentially captures effects hidden to single phenotype GWAS. We demonstrate via simulation that this approach provides a dramatic increase in power in many scenarios. There is a boost in power for variants that affect multiple phenotypes and for those that affect only one phenotype. While other multivariate methods have similar power gains, we describe several benefits of MultiPhen over these. In particular, we demonstrate that other multivariate methods that assume the genotypes are normally distributed, such as canonical correlation analysis (CCA) and MANOVA, can have highly inflated type-1 error rates when testing case-control or non-normal continuous phenotypes, while MultiPhen produces no such inflation. To test the performance of MultiPhen on real data we applied it to lipid traits in the Northern Finland Birth Cohort 1966 (NFBC1966). In these data MultiPhen discovers 21% more independent SNPs with known associations than the standard univariate GWAS approach, while applying MultiPhen in addition to the standard approach provides 37% increased discovery. The most associated linear combinations of the lipids estimated by MultiPhen at the leading SNPs accurately reflect the Friedewald Formula, suggesting that MultiPhen could be used to refine the definition of existing phenotypes or uncover novel heritable phenotypes.
全基因组关联研究(GWAS)方法已经发现了数百种与疾病和数量性状相关的遗传变异。然而,尽管许多表型之间存在临床重叠和统计学相关性,但 GWAS 通常是逐个表型进行的。在这里,我们比较了同时对多个表型进行建模的方法与标准单变量方法的性能。我们引入了一种新的方法和软件 MultiPhen,它可以快速、可解释地同时对多个表型进行建模。通过进行有序回归,MultiPhen 测试了与每个 SNP 基因型最相关的表型的线性组合,从而可能捕获到单表型 GWAS 隐藏的效应。通过模拟,我们证明了这种方法在许多情况下显著提高了功效。对于影响多个表型的变体和仅影响一个表型的变体,功效都有所提高。虽然其他多变量方法具有类似的功效增益,但我们描述了 MultiPhen 相对于这些方法的几个优势。特别是,我们证明了那些假设基因型呈正态分布的其他多变量方法,如典型相关分析(CCA)和多变量方差分析(MANOVA),在测试病例对照或非正态连续表型时,可能会产生高度膨胀的一类错误率,而 MultiPhen 则不会产生这种膨胀。为了在真实数据上测试 MultiPhen 的性能,我们将其应用于 1966 年芬兰北部出生队列(NFBC1966)中的脂质特征。在这些数据中,MultiPhen 发现了比标准单变量 GWAS 方法多 21%的具有已知关联的独立 SNP,而在标准方法之外应用 MultiPhen 则提供了 37%的发现增加。MultiPhen 在主要 SNP 上对脂质进行估计的最相关线性组合准确地反映了 Friedewald 公式,这表明 MultiPhen 可以用于细化现有表型的定义或发现新的遗传性表型。