Departments of Genetics and Community and Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA.
Bioinformatics. 2014 Jun 15;30(12):1698-706. doi: 10.1093/bioinformatics/btu110. Epub 2014 Feb 25.
Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets.
We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results.
基因集富集已成为解释高通量基因组实验结果的关键工具。然而,注释质量不一致和缺乏注释特异性限制了富集方法的统计能力,并使得在生物学相似的数据集之间难以复制富集结果。
我们提出了一种新的算法,用于优化基因集注释,以最佳匹配特定经验数据源的结构。我们提出的方法是通过变量聚类的最小熵(EMVC),对每个基因集的注释进行过滤,以最小化针对多个自举重采样数据集的多个聚类大小计算的不相交基因聚类的熵度量。正如使用模拟基因集和带有微阵列基因表达数据的分子特征数据库集合的模拟数据所显示的那样,EMVC 算法准确地过滤与实验结果无关的注释,从而提高了基因集富集能力,并更好地复制了富集结果。