Kim Ki-Yeol, Ki Dong Hyuk, Jeung Hei-Cheul, Chung Hyun Cheol, Rha Sun Young
Oral Cancer Research Institute, Yonsei University College of Dentistry, Seoul, 120-752, South Korea.
BMC Bioinformatics. 2008 Jun 16;9:283. doi: 10.1186/1471-2105-9-283.
The information from different data sets experimented under different conditions may be inconsistent even though they are performed with the same research objectives. More than that, even when the data sets were generated from the same platform, the data agreement may be affected by the technical variation among the laboratories. In this case, it is necessary to use the combined data set after adjusting the differences between such data sets, for detecting the more reliable information.
The proposed method combines data sets posterior to the discretization of data sets based on the ranks of the gene expression ratios, and the statistical method is applied to the combined data set for predictive gene selection. The efficiency of the proposed method was evaluated using five colon cancer related data sets, which were experimented using cDNA microarrays with different RNA sources, and one experiment utilized oligonucleotide arrays. NCI-60 cell lines data sets were used, which were performed with two different platforms of cDNA microarrays and Affymetrix HU6800 oligonucleotide arrays. The combined data set by the proposed method predicted the test data sets more accurately than the separated data sets did. The biological significant genes were detected from the combined data set, which were missed on the separated data sets.
By transforming gene expressions using ranks, the proposed method is not influenced by systematic bias among chips and normalization method. The method may be especially more useful to find predictive genes from data sets which have different scale in gene expressions.
即使在相同的研究目标下,在不同条件下进行实验得到的不同数据集的信息可能不一致。不仅如此,即使数据集是由同一平台生成的,数据一致性也可能受到各实验室技术差异的影响。在这种情况下,有必要在调整这些数据集之间的差异后使用合并后的数据集,以检测更可靠的信息。
所提出的方法在基于基因表达率的秩对数据集进行离散化之后合并数据集,并将统计方法应用于合并后的数据集以进行预测基因选择。使用五个与结肠癌相关的数据集对所提出方法的效率进行了评估,这些数据集是使用来自不同RNA来源的cDNA微阵列进行实验得到的,且有一个实验使用了寡核苷酸阵列。使用了NCI - 60细胞系数据集,其通过cDNA微阵列和Affymetrix HU6800寡核苷酸阵列这两种不同平台进行实验。所提出方法得到的合并数据集比单独的数据集更准确地预测了测试数据集。从合并数据集中检测到了在单独数据集中遗漏的具有生物学意义的基因。
通过使用秩来转换基因表达,所提出的方法不受芯片间系统偏差和归一化方法的影响。该方法对于从基因表达具有不同规模的数据集中寻找预测基因可能特别有用。