Manoli Theodora, Gretz Norbert, Gröne Hermann-Josef, Kenzelmann Marc, Eils Roland, Brors Benedikt
Theoretical Bioinformatics, German Cancer Reseach Center, 69120 Heidelberg, Germany.
Bioinformatics. 2006 Oct 15;22(20):2500-6. doi: 10.1093/bioinformatics/btl424. Epub 2006 Aug 7.
The wide use of DNA microarrays for the investigation of the cell transcriptome triggered the invention of numerous methods for the processing of microarray data and lead to a growing number of microarray studies that examine the same biological conditions. However, comparisons made on the level of gene lists obtained by different statistical methods or from different datasets hardly converge. We aimed at examining such discrepancies on the level of apparently affected biologically related groups of genes, e.g. metabolic or signalling pathways. This can be achieved by group testing procedures, e.g. over-representation analysis, functional class scoring (FCS), or global tests.
Three public prostate cancer datasets obtained with the same microarray platform (HGU95A/HGU95Av2) were analyzed. Each dataset was subjected to normalization by either variance stabilizing normalization (vsn) or mixed model normalization (MMN). Then, statistical analysis of microarrays was applied to the vsn-normalized data and mixed model analysis to the data normalized by MMN. For multiple testing adjustment the false discovery rate was calculated and the threshold was set to 0.05. Gene lists from the same method applied to different datasets showed overlaps between 42 and 52%, while lists from different methods applied to the same dataset had between 63 and 85% of genes in common. A number of six gene lists obtained by the two statistical methods applied to the three datasets was then subjected to group testing by Fisher's exact test. Group testing by GSEA and global test was applied to the three datasets, as well. Fisher's exact test followed by global test showed more consistent results with respect to the concordance between analyses on gene lists obtained by different methods and different datasets than the GSEA. However, all group testing methods identified pathways that had already been described to be involved in the pathogenesis of prostate cancer. Moreover, pathways recurrently identified in these analyses are more likely to be reliable than those from a single analysis on a single dataset.
DNA微阵列在细胞转录组研究中的广泛应用催生了众多处理微阵列数据的方法,导致越来越多的微阵列研究针对相同的生物学条件进行。然而,基于不同统计方法或不同数据集获得的基因列表进行的比较几乎无法达成一致。我们旨在在明显受影响的生物学相关基因群组层面,例如代谢或信号通路,研究此类差异。这可以通过群组检验程序来实现,例如过度表达分析、功能类别评分(FCS)或全局检验。
分析了通过相同微阵列平台(HGU95A/HGU95Av2)获得的三个前列腺癌公共数据集。每个数据集分别采用方差稳定归一化(vsn)或混合模型归一化(MMN)进行归一化处理。然后,对经vsn归一化的数据进行微阵列统计分析,对经MMN归一化的数据进行混合模型分析。对于多重检验校正,计算错误发现率并将阈值设定为0.05。应用于不同数据集的相同方法得到的基因列表之间的重叠率在42%至52%之间,而应用于相同数据集的不同方法得到的列表之间有63%至85%的基因相同。然后,对应用于三个数据集的两种统计方法得到的六个基因列表进行Fisher精确检验的群组检验。GSEA和全局检验的群组检验也应用于这三个数据集。与GSEA相比,Fisher精确检验后接全局检验在不同方法和不同数据集获得的基因列表分析之间的一致性方面显示出更一致的结果。然而,所有群组检验方法都识别出了已被描述为参与前列腺癌发病机制的通路。此外,在这些分析中反复识别出的通路比单个数据集的单一分析所识别的通路更可能是可靠的。