Langdon William B, Upton Graham J G, Harrison Andrew P
Department of Mathematical Sciences and Department of Biological Sciences, University of Essex, Wivenhoe Park, Colchester, Essex CO4 3SQ, UK.
Brief Bioinform. 2009 May;10(3):259-77. doi: 10.1093/bib/bbp018. Epub 2009 Apr 8.
The reliable interpretation of Affymetrix GeneChip data is a multi-faceted problem. The interplay between biophysics, bioinformatics and mining of GeneChip surveys is leading to new insights into how best to analyse the data. Many of the molecular processes occurring on the surfaces of GeneChips result from the high surface density of probes. Interactions between neighbouring adjacent probes affect their rate and strength of hybridization to targets. Competing targets may hybridize to the same probe, and targets may partially bind to more than one probe. The formation of these partial hybrids results in a number of probes not reaching thermodynamic equilibrium during hybridization. Moreover, some targets fold up, or cross-hybridize to other targets. Furthermore, probes may fold and can undergo chemical saturation. There are also sequence-dependent differences in the rates of target desorption during the washing stage. Improvements in the mappings between probe sequence and biological databases are leading to more accurate gene expression profiles. Moreover, algorithms that combine the intensities of multiple probes into single measures of expression are increasingly dependent upon models of the hybridization processes occurring on GeneChips. The large repositories of GeneChip data can be searched for systematic effects across many experiments. This data mining has led to the discovery of a family of thousands of probes, which show correlated expression across thousands of GeneChip experiments. These probes contain runs of guanines, suggesting that G-quadruplexes are able to form on GeneChips. We discuss the impact of these structures on the interpretation of data from GeneChip experiments.
对Affymetrix基因芯片数据进行可靠解读是一个多方面的问题。生物物理学、生物信息学以及基因芯片检测数据挖掘之间的相互作用,正为如何最佳分析数据带来新的见解。基因芯片表面发生的许多分子过程是由探针的高表面密度导致的。相邻探针之间的相互作用会影响它们与靶标的杂交速率和强度。相互竞争的靶标可能会与同一探针杂交,并且靶标可能会部分结合到多个探针上。这些部分杂交体的形成导致许多探针在杂交过程中无法达到热力学平衡。此外,一些靶标会折叠,或者与其他靶标发生交叉杂交。而且,探针可能会折叠并可能发生化学饱和。在洗涤阶段,靶标解吸速率也存在序列依赖性差异。探针序列与生物数据库之间映射关系的改进,正带来更准确的基因表达谱。此外,将多个探针的强度组合成单一表达量度的算法越来越依赖于基因芯片上发生的杂交过程模型。可以在大量基因芯片数据存储库中搜索多个实验中的系统效应。这种数据挖掘已导致发现了一个由数千个探针组成的家族,它们在数千个基因芯片实验中呈现出相关表达。这些探针含有鸟嘌呤序列,表明在基因芯片上能够形成G-四链体。我们讨论了这些结构对基因芯片实验数据解读的影响。