Oron Assaf P, Jiang Zhen, Gentleman Robert
Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109-1024, USA.
Bioinformatics. 2008 Nov 15;24(22):2586-91. doi: 10.1093/bioinformatics/btn465. Epub 2008 Sep 11.
Gene-set enrichment analysis (GSEA) can be greatly enhanced by linear model (regression) diagnostic techniques. Diagnostics can be used to identify outlying or influential samples, and also to evaluate model fit and explore model expansion.
We demonstrate this methodology on an adult acute lymphoblastic leukemia (ALL) dataset, using GSEA based on chromosome-band mapping of genes. Individual residuals, grouped or aggregated by chromosomal loci, indicate problematic samples and potential data-entry errors, and help identify hyperdiploidy as a factor playing a key role in expression for this dataset. Subsequent analysis pinpoints suspected DNA copy number abnormalities of specific samples and chromosomes (most prevalent are chromosomes X, 21 and 14), and also reveals significant expression differences between the hyperdiploid and diploid groups on other chromosomes (most prominently 19, 22, 3 and 13)--differences which are apparently not associated with copy number.
Software for the statistical tools demonstrated in this article is available as Bioconductor package GSEAlm.
Supplementary data are available at Bioinformatics online.
基因集富集分析(GSEA)可通过线性模型(回归)诊断技术得到极大增强。诊断可用于识别异常或有影响力的样本,还可用于评估模型拟合情况并探索模型扩展。
我们在一个成人急性淋巴细胞白血病(ALL)数据集上展示了这种方法,使用基于基因染色体带图谱的GSEA。按染色体位点分组或汇总的个体残差表明存在问题的样本和潜在的数据录入错误,并有助于确定超二倍体是该数据集中在表达方面起关键作用的一个因素。后续分析确定了特定样本和染色体(最常见的是X、21和14号染色体)疑似的DNA拷贝数异常,还揭示了超二倍体组和二倍体组在其他染色体(最显著的是19、22、3和13号染色体)上存在显著的表达差异——这些差异显然与拷贝数无关。
本文中展示的统计工具软件可作为生物导体包GSEAlm获取。
补充数据可在《生物信息学》在线获取。