Sanchez-Graillet Olivia, Rowsell Joanna, Langdon William B, Stalteri Maria, Arteaga-Salas Jose M, Upton Graham J G, Harrison Andrew P
Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, Essex, CO4 3SQ, UK.
J Integr Bioinform. 2008 Aug 25;5(2):98. doi: 10.2390/biecoll-jib-2008-98.
We have developed a computational pipeline to analyse large surveys of Affymetrix GeneChips, for example NCBI's Gene Expression Omnibus. GEO samples data for many organisms, tissues and phenotypes. Because of this experimental diversity, any observed correlations between probe intensities can be associated either with biology that is robust, such as common co-expression, or with systematic biases associated with the GeneChip technology. Our bioinformatics pipeline integrates the mapping of probes to exons, quality control checks on each GeneChip which identifies flaws in hybridization quality, and the mining of correlations in intensities between groups of probes. The output from our pipeline has enabled us to identify systematic biases in GeneChip data. We are also able to use the pipeline as a discovery tool for biology. We have discovered that in the majority of cases, Affymetrix probesets on Human GeneChips do not measure one unique block of transcription. Instead we see numerous examples of outlier probes. Our study has also identified that in a number of probesets the mismatch probes are an informative diagnostic of expression, rather than providing a measure of background contamination. We report evidence for systematic biases in GeneChip technology associated with probe-probe interactions. We also see signatures associated with post-transcriptional processing of RNA, such as alternative polyadenylation.
我们开发了一种计算流程,用于分析Affymetrix基因芯片的大规模调查,例如美国国立生物技术信息中心(NCBI)的基因表达综合数据库(Gene Expression Omnibus)。基因表达综合数据库包含许多生物体、组织和表型的样本数据。由于这种实验多样性,探针强度之间任何观察到的相关性,要么与稳健的生物学现象相关,如共同共表达,要么与基因芯片技术相关的系统偏差有关。我们的生物信息学流程整合了探针到外显子的映射、对每个基因芯片的质量控制检查(以识别杂交质量方面的缺陷)以及对探针组之间强度相关性的挖掘。我们流程的输出使我们能够识别基因芯片数据中的系统偏差。我们还能够将该流程用作生物学的发现工具。我们发现,在大多数情况下,人类基因芯片上的Affymetrix探针集并非测量一个独特的转录片段。相反,我们看到了大量异常值探针的例子。我们的研究还发现,在许多探针集中,错配探针是表达的一个信息丰富的诊断指标,而不是提供背景污染的测量值。我们报告了与探针 - 探针相互作用相关的基因芯片技术系统偏差的证据。我们还看到了与RNA转录后加工相关的特征,如可变聚腺苷酸化。