University of Queensland Diamantina Institute, The Translational Research Institute, Brisbane, Queensland 4102, Australia.
Genetics. 2013 Nov;195(3):1117-28. doi: 10.1534/genetics.113.153221. Epub 2013 Sep 11.
Principal components analysis has been employed in gene expression studies to correct for population substructure and batch and environmental effects. This method typically involves the removal of variation contained in as many as 50 principal components (PCs), which can constitute a large proportion of total variation present in the data. Each PC, however, can detect many sources of variation, including gene expression networks and genetic variation influencing transcript levels. We demonstrate that PCs generated from gene expression data can simultaneously contain both genetic and nongenetic factors. From heritability estimates we show that all PCs contain a considerable portion of genetic variation while nongenetic artifacts such as batch effects were associated to varying degrees with the first 60 PCs. These PCs demonstrate an enrichment of biological pathways, including core immune function and metabolic pathways. The use of PC correction in two independent data sets resulted in a reduction in the number of cis- and trans-expression QTL detected. Comparisons of PC and linear model correction revealed that PC correction was not as efficient at removing known batch effects and had a higher penalty on genetic variation. Therefore, this study highlights the danger of eliminating biologically relevant data when employing PC correction in gene expression data.
主成分分析已被应用于基因表达研究中,以校正群体亚结构、批次和环境效应。该方法通常涉及去除多达 50 个主成分(PC)中的变异,这些变异可能构成数据中总变异的很大一部分。然而,每个 PC 都可以检测到许多变异来源,包括基因表达网络和影响转录水平的遗传变异。我们证明了从基因表达数据中生成的 PC 可以同时包含遗传和非遗传因素。从遗传力估计中我们可以看出,所有的 PC 都包含了相当一部分的遗传变异,而批次效应等非遗传伪影则与前 60 个 PC 有不同程度的关联。这些 PC 显示出生物途径的富集,包括核心免疫功能和代谢途径。在两个独立的数据集中使用 PC 校正导致 cis 和 trans 表达 QTL 的数量减少。PC 校正和线性模型校正的比较表明,PC 校正去除已知批次效应的效率不如线性模型校正,并且对遗传变异的惩罚更高。因此,本研究强调了在基因表达数据中使用 PC 校正时消除生物学相关数据的危险。