Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation 825 NE 13th Street, Oklahoma City, Oklahoma 73104-5005, USA.
BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S14. doi: 10.1186/1471-2105-12-S10-S14.
Global meta-analysis (GMA) of microarray data to identify genes with highly similar co-expression profiles is emerging as an accurate method to predict gene function and phenotype, even in the absence of published data on the gene(s) being analyzed. With a third of human genes still uncharacterized, this approach is a promising way to direct experiments and rapidly understand the biological roles of genes. To predict function for genes of interest, GMA relies on a guilt-by-association approach to identify sets of genes with known functions that are consistently co-expressed with it across different experimental conditions, suggesting coordinated regulation for a specific biological purpose. Our goal here is to define how sample, dataset size and ranking parameters affect prediction performance.
13,000 human 1-color microarrays were downloaded from GEO for GMA analysis. Prediction performance was benchmarked by calculating the distance within the Gene Ontology (GO) tree between predicted function and annotated function for sets of 100 randomly selected genes. We find the number of new predicted functions rises as more datasets are added, but begins to saturate at a sample size of approximately 2,000 experiments. For the gene set used to predict function, we find precision to be higher with smaller set sizes, yet with correspondingly poor recall and, as set size is increased, recall and F-measure also tend to increase but at the cost of precision.
Of the 20,813 genes expressed in 50 or more experiments, at least one predicted GO category was found for 72.5% of them. Of the 5,720 genes without GO annotation, 4,189 had at least one predicted ontology using top 40 co-expressed genes for prediction analysis. For the remaining 1,531 genes without GO predictions or annotations, ~17% (257 genes) had sufficient co-expression data yet no statistically significantly overrepresented ontologies, suggesting their regulation may be more complex.
通过对微阵列数据进行全球荟萃分析(GMA),以识别具有高度相似共表达谱的基因,这是一种预测基因功能和表型的准确方法,即使在缺乏正在分析的基因的已发表数据的情况下也是如此。由于三分之一的人类基因仍未被描述,因此这种方法是一种很有前途的方法,可以指导实验并快速了解基因的生物学作用。为了预测感兴趣基因的功能,GMA 依赖于一种关联罪责的方法来识别一组具有已知功能的基因,这些基因在不同的实验条件下与它一致地共表达,表明为特定的生物学目的进行协调调控。我们的目标是定义样本、数据集大小和排名参数如何影响预测性能。
从 GEO 下载了 13000 个人类 1 色微阵列进行 GMA 分析。通过计算 100 个随机选择的基因集的预测功能和注释功能之间在基因本体论(GO)树内的距离来评估预测性能。我们发现,随着数据集的增加,新预测功能的数量增加,但在样本量约为 2000 次实验时开始饱和。对于用于预测功能的基因集,我们发现,随着集的大小减小,精度更高,但召回率相应较低,并且随着集的大小增加,召回率和 F-measure 也趋于增加,但代价是精度降低。
在 50 次或更多实验中表达的 20813 个基因中,至少有一个预测的 GO 类别可以找到 72.5%的基因。在没有 GO 注释的 5720 个基因中,使用前 40 个共表达基因进行预测分析,有 4189 个基因至少有一个预测的本体论。对于其余的 1531 个没有 GO 预测或注释的基因,约 17%(257 个基因)有足够的共表达数据,但没有统计学上显著的过表达本体论,这表明它们的调控可能更为复杂。