Verhaak Roel G W, Staal Frank J T, Valk Peter J M, Lowenberg Bob, Reinders Marcel J T, de Ridder Dick
Department of Hematology, Erasmus Medical Center, Rotterdam, The Netherlands.
BMC Bioinformatics. 2006 Mar 2;7:105. doi: 10.1186/1471-2105-7-105.
Intensity values measured by Affymetrix microarrays have to be both normalized, to be able to compare different microarrays by removing non-biological variation, and summarized, generating the final probe set expression values. Various pre-processing techniques, such as dChip, GCRMA, RMA and MAS have been developed for this purpose. This study assesses the effect of applying different pre-processing methods on the results of analyses of large Affymetrix datasets. By focusing on practical applications of microarray-based research, this study provides insight into the relevance of pre-processing procedures to biology-oriented researchers.
Using two publicly available datasets, i.e., gene-expression data of 285 patients with Acute Myeloid Leukemia (AML, Affymetrix HG-U133A GeneChip) and 42 samples of tumor tissue of the embryonal central nervous system (CNS, Affymetrix HuGeneFL GeneChip), we tested the effect of the four pre-processing strategies mentioned above, on (1) expression level measurements, (2) detection of differential expression, (3) cluster analysis and (4) classification of samples. In most cases, the effect of pre-processing is relatively small compared to other choices made in an analysis for the AML dataset, but has a more profound effect on the outcome of the CNS dataset. Analyses on individual probe sets, such as testing for differential expression, are affected most; supervised, multivariate analyses such as classification are far less sensitive to pre-processing.
Using two experimental datasets, we show that the choice of pre-processing method is of relatively minor influence on the final analysis outcome of large microarray studies whereas it can have important effects on the results of a smaller study. The data source (platform, tissue homogeneity, RNA quality) is potentially of bigger importance than the choice of pre-processing method.
通过Affymetrix微阵列测量的强度值必须进行归一化处理,以便能够通过消除非生物学变异来比较不同的微阵列,并且要进行汇总,以生成最终的探针集表达值。为此已经开发了各种预处理技术,例如dChip、GCRMA、RMA和MAS。本研究评估了应用不同预处理方法对大型Affymetrix数据集分析结果的影响。通过关注基于微阵列研究的实际应用,本研究为以生物学为导向的研究人员提供了关于预处理程序相关性的见解。
使用两个公开可用的数据集,即285例急性髓系白血病(AML,Affymetrix HG-U133A基因芯片)患者的基因表达数据和42例胚胎中枢神经系统肿瘤组织(CNS,Affymetrix HuGeneFL基因芯片)样本,我们测试了上述四种预处理策略对(1)表达水平测量、(2)差异表达检测、(3)聚类分析和(4)样本分类的影响。在大多数情况下,与AML数据集中分析中做出的其他选择相比,预处理的影响相对较小,但对CNS数据集的结果有更深远的影响。对单个探针集的分析,如差异表达测试,受影响最大;像分类这样的监督多变量分析对预处理的敏感性要低得多。
使用两个实验数据集,我们表明预处理方法的选择对大型微阵列研究的最终分析结果影响相对较小,而对较小研究结果可能有重要影响。数据来源(平台、组织同质性、RNA质量)可能比预处理方法的选择更重要。