Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada.
PLoS One. 2012;7(8):e41694. doi: 10.1371/journal.pone.0041694. Epub 2012 Aug 8.
The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.
对罕见遗传变异与疾病或表型之间关联的研究有两个目标。首先,确定哪些基因或基因组区域与疾病或表型相关联;其次,在每个区域内将相关变异与背景噪声区分开来。在过去的几年中,已经开发出许多新的方法来将基因组区域与表型相关联。然而,经典的高维数据方法却很少受到关注。在这里,我们研究了几种经典的高维数据统计方法:岭回归(RR)、主成分回归(PCR)、偏最小二乘回归(PLS)、PLS 的稀疏版本(SPLS)和 LASSO 是否能够检测到与罕见遗传变异的关联。这些方法在统计学中被广泛用于识别包含许多预测变量的数据集中的真实关联。我们使用在 1998 个人中进行桑格测序的三个基因中鉴定的遗传变异,模拟了几种不同模型下的连续表型,并表明这些特征选择和特征提取方法可以大大优于几种用于罕见变异分析的流行方法。此外,这些方法可以确定哪些变异对模型拟合的贡献最大,因此可以同时使用回归正则化方法来实现罕见变异分析的两个目标。我们使用脂联素水平和 ADIPOQ 基因中的变异的分析简要说明了这些方法。