针对非单调关联和多个实验类别的基因集富集分析。
Gene set enrichment analysis for non-monotone association and multiple experimental categories.
作者信息
Lin Rongheng, Dai Shuangshuang, Irwin Richard D, Heinloth Alexandra N, Boorman Gary A, Li Leping
机构信息
Biostatistics Branch, National Institute of Environmental Health Science, Research Triangle Park, NC 27713, USA.
出版信息
BMC Bioinformatics. 2008 Nov 14;9:481. doi: 10.1186/1471-2105-9-481.
BACKGROUND
Recently, microarray data analyses using functional pathway information, e.g., gene set enrichment analysis (GSEA) and significance analysis of function and expression (SAFE), have gained recognition as a way to identify biological pathways/processes associated with a phenotypic endpoint. In these analyses, a local statistic is used to assess the association between the expression level of a gene and the value of a phenotypic endpoint. Then these gene-specific local statistics are combined to evaluate association for pre-selected sets of genes. Commonly used local statistics include t-statistics for binary phenotypes and correlation coefficients that assume a linear or monotone relationship between a continuous phenotype and gene expression level. Methods applicable to continuous non-monotone relationships are needed. Furthermore, for multiple experimental categories, methods that combine multiple GSEA/SAFE analyses are needed.
RESULTS
For continuous or ordinal phenotypic outcome, we propose to use as the local statistic the coefficient of multiple determination (i.e., the square of multiple correlation coefficient) R2 from fitting natural cubic spline models to the phenotype-expression relationship. Next, we incorporate this association measure into the GSEA/SAFE framework to identify significant gene sets. Unsigned local statistics, signed global statistics and one-sided p-values are used to reflect our inferential interest. Furthermore, we describe a procedure for inference across multiple GSEA/SAFE analyses. We illustrate our approach using gene expression and liver injury data from liver and blood samples from rats treated with eight hepatotoxicants under multiple time and dose combinations. We set out to identify biological pathways/processes associated with liver injury as manifested by increased blood levels of alanine transaminase in common for most of the eight compounds. Potential statistical dependency resulting from the experimental design is addressed in permutation based hypothesis testing.
CONCLUSION
The proposed framework captures both linear and non-linear association between gene expression level and a phenotypic endpoint and thus can be viewed as extending the current GSEA/SAFE methodology. The framework for combining results from multiple GSEA/SAFE analyses is flexible to address practical inference interests. Our methods can be applied to microarray data with continuous phenotypes with multi-level design or the meta-analysis of multiple microarray data sets.
背景
最近,利用功能通路信息进行的微阵列数据分析,例如基因集富集分析(GSEA)和功能与表达显著性分析(SAFE),已被认可为一种识别与表型终点相关的生物通路/过程的方法。在这些分析中,使用局部统计量来评估基因表达水平与表型终点值之间的关联。然后,将这些基因特异性的局部统计量组合起来,以评估预先选择的基因集的关联。常用的局部统计量包括用于二元表型的t统计量以及假设连续表型与基因表达水平之间存在线性或单调关系的相关系数。需要适用于连续非单调关系的方法。此外,对于多个实验类别,需要能够结合多个GSEA/SAFE分析的方法。
结果
对于连续或有序的表型结果,我们建议使用通过将自然三次样条模型拟合到表型-表达关系而得到的多重决定系数(即多重相关系数的平方)R²作为局部统计量。接下来,我们将这种关联度量纳入GSEA/SAFE框架以识别显著的基因集。使用无符号局部统计量、有符号全局统计量和单侧p值来反映我们的推断兴趣。此外,我们描述了一种跨多个GSEA/SAFE分析进行推断的程序。我们使用来自用八种肝毒性剂在多个时间和剂量组合下处理的大鼠的肝脏和血液样本的基因表达和肝损伤数据来说明我们的方法。我们着手识别与肝损伤相关的生物通路/过程,这在八种化合物中的大多数中表现为丙氨酸转氨酶血液水平升高。实验设计导致的潜在统计依赖性在基于置换的假设检验中得到解决。
结论
所提出的框架捕捉了基因表达水平与表型终点之间的线性和非线性关联,因此可以被视为对当前GSEA/SAFE方法的扩展。结合多个GSEA/SAFE分析结果的框架灵活地解决了实际推断兴趣。我们的方法可应用于具有多水平设计的连续表型的微阵列数据或多个微阵列数据集的荟萃分析。