College of Chemistry, Sichuan University, Chengdu, 610064, People's Republic of China.
BMC Bioinformatics. 2013 Apr 29;14:143. doi: 10.1186/1471-2105-14-143.
Reliability and Reproducibility of differentially expressed genes (DEGs) are essential for the biological interpretation of microarray data. The microarray quality control (MAQC) project launched by US Food and Drug Administration (FDA) elucidated that the lists of DEGs generated by intra- and inter-platform comparisons can reach a high level of concordance, which mainly depended on the statistical criteria used for ranking and selecting DEGs. Generally, it will produce reproducible lists of DEGs when combining fold change ranking with a non-stringent p-value cutoff. For further interpretation of the gene expression data, statistical methods of gene enrichment analysis provide powerful tools for associating the DEGs with prior biological knowledge, e.g. Gene Ontology (GO) terms and pathways, and are widely used in genome-wide research. Although the DEG lists generated from the same compared conditions proved to be reliable, the reproducible enrichment results are still crucial to the discovery of the underlying molecular mechanism differentiating the two conditions. Therefore, it is important to know whether the enrichment results are still reproducible, when using the lists of DEGs generated by different statistic criteria from inter-laboratory and cross-platform comparisons. In our study, we used the MAQC data sets for systematically accessing the intra- and inter-platform concordance of GO terms enriched by Gene Set Enrichment Analysis (GSEA) and LRpath.
In intra-platform comparisons, the overlapped percentage of enriched GO terms was as high as ~80% when the inputted lists of DEGs were generated by fold change ranking and Significance Analysis of Microarrays (SAM), whereas the percentages decreased about 20% when generating the lists of DEGs by using fold change ranking and t-test, or by using SAM and t-test. Similar results were found in inter-platform comparisons.
Our results demonstrated that the lists of DEGs in a high level of concordance can ensure the high concordance of enrichment results. Importantly, based on the lists of DEGs generated by a straightforward method of combining fold change ranking with a non-stringent p-value cutoff, enrichment analysis will produce reproducible enriched GO terms for the biological interpretation.
差异表达基因(DEG)的可靠性和可重复性对于微阵列数据的生物学解释至关重要。美国食品和药物管理局(FDA)开展的微阵列质量控制(MAQC)项目阐明,通过平台内和平台间比较生成的 DEG 列表可以达到高度一致,这主要取决于用于对 DEG 进行排名和选择的统计标准。通常,当将 fold change 排名与非严格的 p 值截止值结合使用时,将生成可重复的 DEG 列表。为了进一步解释基因表达数据,基因富集分析的统计方法为将 DEG 与先前的生物学知识(例如基因本体论(GO)术语和途径)相关联提供了强大的工具,并广泛用于全基因组研究。尽管来自相同比较条件的 DEG 列表被证明是可靠的,但可重复的富集结果对于发现区分两种条件的潜在分子机制仍然至关重要。因此,当使用来自不同实验室和跨平台比较的不同统计标准生成的 DEG 列表时,了解富集结果是否仍然可重复非常重要。在我们的研究中,我们使用 MAQC 数据集系统地评估了基因集富集分析(GSEA)和 LRpath 富集的 GO 术语的平台内和平台间一致性。
在平台内比较中,当通过 fold change 排名和显著分析微阵列(SAM)生成 DEG 列表时,富集的 GO 术语的重叠百分比高达约 80%,而当通过使用 fold change 排名和 t 检验生成 DEG 列表时,或使用 SAM 和 t 检验时,该百分比降低了约 20%。在平台间比较中也发现了类似的结果。
我们的结果表明,高度一致的 DEG 列表可以确保富集结果的高度一致性。重要的是,基于通过简单方法结合 fold change 排名和非严格的 p 值截止值生成的 DEG 列表,富集分析将为生物学解释产生可重复的富集 GO 术语。