Hwang Kyu-Baek, Kong Sek Won, Greenberg Steve A, Park Peter J
School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea.
BMC Bioinformatics. 2004 Oct 25;5:159. doi: 10.1186/1471-2105-5-159.
One of the important challenges in microarray analysis is to take full advantage of previously accumulated data, both from one's own laboratory and from public repositories. Through a comparative analysis on a variety of datasets, a more comprehensive view of the underlying mechanism or structure can be obtained. However, as we discover in this work, continual changes in genomic sequence annotations and probe design criteria make it difficult to compare gene expression data even from different generations of the same microarray platform.
We first describe the extent of discordance between the results derived from two generations of Affymetrix oligonucleotide arrays, as revealed in cluster analysis and in identification of differentially expressed genes. We then propose a method for increasing comparability. The dataset we use consists of a set of 14 human muscle biopsy samples from patients with inflammatory myopathies that were hybridized on both HG-U95Av2 and HG-U133A human arrays. We find that the use of the probe set matching table for comparative analysis provided by Affymetrix produces better results than matching by UniGene or LocusLink identifiers but still remains inadequate. Rescaling of expression values for each gene across samples and data filtering by expression values enhance comparability but only for few specific analyses. As a generic method for improving comparability, we select a subset of probes with overlapping sequence segments in the two array types and recalculate expression values based only on the selected probes. We show that this filtering of probes significantly improves the comparability while retaining a sufficient number of probe sets for further analysis.
Compatibility between high-density oligonucleotide arrays is significantly affected by probe-level sequence information. With a careful filtering of the probes based on their sequence overlaps, data from different generations of microarrays can be combined more effectively.
微阵列分析中的一个重要挑战是充分利用先前积累的数据,这些数据既来自自己的实验室,也来自公共数据库。通过对各种数据集进行比较分析,可以更全面地了解潜在的机制或结构。然而,正如我们在这项工作中所发现的,基因组序列注释和探针设计标准的不断变化使得即使是来自同一微阵列平台不同代次的基因表达数据也难以进行比较。
我们首先描述了在聚类分析和差异表达基因鉴定中所揭示的两代Affymetrix寡核苷酸阵列结果之间的不一致程度。然后我们提出了一种提高可比性的方法。我们使用的数据集包括一组来自炎性肌病患者的14个人类肌肉活检样本,这些样本在HG-U95Av2和HG-U133A人类阵列上进行了杂交。我们发现,使用Affymetrix提供的探针集匹配表进行比较分析比通过UniGene或LocusLink标识符进行匹配产生的结果更好,但仍然不够。对每个基因在样本间的表达值进行重新缩放以及按表达值进行数据过滤可提高可比性,但仅适用于少数特定分析。作为一种提高可比性的通用方法,我们在两种阵列类型中选择具有重叠序列片段的探针子集,并仅基于所选探针重新计算表达值。我们表明,这种探针过滤显著提高了可比性,同时保留了足够数量的探针集用于进一步分析。
高密度寡核苷酸阵列之间的兼容性受到探针水平序列信息的显著影响。通过基于序列重叠仔细过滤探针,可以更有效地组合来自不同代次微阵列的数据。