Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation 825 NE 13th Street, Oklahoma City, Oklahoma 73104-5005, USA.
BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S2. doi: 10.1186/1471-2105-12-S10-S2.
Microarray experiments are becoming increasingly common in biomedical research, as is their deposition in publicly accessible repositories, such as Gene Expression Omnibus (GEO). As such, there has been a surge in interest to use this microarray data for meta-analytic approaches, whether to increase sample size for a more powerful analysis of a specific disease (e.g. lung cancer) or to re-examine experiments for reasons different than those examined in the initial, publishing study that generated them. For the average biomedical researcher, there are a number of practical barriers to conducting such meta-analyses such as manually aggregating, filtering and formatting the data. Methods to automatically process large repositories of microarray data into a standardized, directly comparable format will enable easier and more reliable access to microarray data to conduct meta-analyses.
We present a straightforward, simple but robust against potential outliers method for automatic quality control and pre-processing of tens of thousands of single-channel microarray data files. GEO GDS files are quality checked by comparing parametric distributions and quantile normalized to enable direct comparison of expression level for subsequent meta-analyses.
13,000 human 1-color experiments were processed to create a single gene expression matrix that subsets can be extracted from to conduct meta-analyses. Interestingly, we found that when conducting a global meta-analysis of gene-gene co-expression patterns across all 13,000 experiments to predict gene function, normalization had minimal improvement over using the raw data.
Normalization of microarray data appears to be of minimal importance on analyses based on co-expression patterns when the sample size is on the order of thousands microarray datasets. Smaller subsets, however, are more prone to aberrations and artefacts, and effective means of automating normalization procedures not only empowers meta-analytic approaches, but aids in reproducibility by providing a standard way of approaching the problem.Data availability: matrix containing normalized expression of 20,813 genes across 13,000 experiments is available for download at . Source code for GDS files pre-processing is available from the authors upon request.
微阵列实验在生物医学研究中变得越来越普遍,并且它们被存入了公共可访问的存储库中,例如基因表达综合(GEO)。因此,人们对使用这些微阵列数据进行荟萃分析的兴趣激增,无论是为了增加特定疾病(例如肺癌)的样本量以进行更强大的分析,还是出于与最初生成它们的出版研究不同的原因来重新检查实验。对于普通的生物医学研究人员来说,进行此类荟萃分析存在许多实际障碍,例如手动汇总、过滤和格式化数据。将大量微阵列数据自动处理为标准化、可直接比较的格式的方法将使人们更容易、更可靠地访问微阵列数据以进行荟萃分析。
我们提出了一种简单、直接但稳健的方法,用于自动质量控制和预处理数万个单通道微阵列数据文件。通过比较参数分布和分位数归一化来检查 GEO GDS 文件,以实现后续荟萃分析中表达水平的直接比较。
处理了 13000 个人类 1 色实验,创建了一个可从中提取子集以进行荟萃分析的单个基因表达矩阵。有趣的是,我们发现,当对所有 13000 个实验进行全局基因-基因共表达模式荟萃分析以预测基因功能时,与使用原始数据相比,归一化的改进很小。
当样本量为数千个微阵列数据集时,基于共表达模式的分析中,微阵列数据的归一化似乎不重要。然而,较小的子集更容易出现异常和伪影,自动归一化程序的有效方法不仅增强了荟萃分析方法,而且通过提供一种处理问题的标准方法,有助于可重复性。数据可用性:可从以下网址下载包含 13000 个实验中 20813 个基因的归一化表达的矩阵:。有关 GDS 文件预处理的源代码可应要求向作者索取。