Kim Ki-Yeol, Ki Dong Hyuk, Jeong Ha Jin, Jeung Hei-Cheul, Chung Hyun Cheol, Rha Sun Young
Oral Cancer Research Institute, Yonsei University College of Dentistry, Seoul, Korea.
BMC Bioinformatics. 2007 Jun 25;8:218. doi: 10.1186/1471-2105-8-218.
With microarray technology, variability in experimental environments such as RNA sources, microarray production, or the use of different platforms, can cause bias. Such systematic differences present a substantial obstacle to the analysis of microarray data, resulting in inconsistent and unreliable information. Therefore, one of the most pressing challenges in the field of microarray technology is how to integrate results from different microarray experiments or combine data sets prior to the specific analysis.
Two microarray data sets based on a 17k cDNA microarray system were used, consisting of 82 normal colon mucosa and 72 colorectal cancer tissues. Each data set was prepared from either total RNA or amplified mRNA, and the difference of RNA source between these two data sets was detected by ANOVA (Analysis of variance) model. A simple integration method was introduced which was based on the distributions of gene expression ratios among different microarray data sets. The method transformed gene expression ratios into the form of a reference data set on a gene by gene basis. Hierarchical clustering analysis, density and box plots, and mixture scores with correlation coefficients revealed that the two data sets were well intermingled, indicating that the proposed method minimized the experimental bias. In addition, any RNA source effect was not detected by the proposed transformation method. In the mixed data set, two previously identified subgroups of normal and tumor were well separated, and the efficiency of integration was more prominent in tumor groups than normal groups. The transformation method was slightly more effective when a data set with strong homogeneity in the same experimental group was used as a reference data set.
Proposed method is simple but useful to combine several data sets from different experimental conditions. With this method, biologically useful information can be detectable by applying various analytic methods to the combined data set with increased sample size.
利用微阵列技术,诸如RNA来源、微阵列生产或不同平台的使用等实验环境中的变异性可能会导致偏差。这种系统差异对微阵列数据分析构成了重大障碍,导致信息不一致且不可靠。因此,微阵列技术领域最紧迫的挑战之一是如何在进行特定分析之前整合来自不同微阵列实验的结果或合并数据集。
使用了基于17k cDNA微阵列系统的两个微阵列数据集,包括82个正常结肠黏膜组织和72个结肠直肠癌组织。每个数据集分别由总RNA或扩增的mRNA制备,通过方差分析(ANOVA)模型检测这两个数据集之间RNA来源的差异。引入了一种基于不同微阵列数据集之间基因表达比率分布的简单整合方法。该方法逐基因地将基因表达比率转换为参考数据集的形式。层次聚类分析、密度图和箱线图以及具有相关系数的混合得分表明,这两个数据集很好地混合在一起,表明所提出的方法将实验偏差最小化。此外,所提出的转换方法未检测到任何RNA来源效应。在混合数据集中,先前确定的正常和肿瘤两个亚组得到了很好的分离,并且整合效率在肿瘤组中比正常组更为显著。当将同一实验组中具有强同质性的数据集用作参考数据集时,转换方法的效果略好。
所提出的方法简单但对于合并来自不同实验条件的多个数据集很有用。通过这种方法,通过对样本量增加的合并数据集应用各种分析方法,可以检测到生物学上有用的信息。