Fogel Paul, Young S Stanley, Hawkins Douglas M, Ledirac Nathalie
Consultant 4 rue Le Goff, F-75005, Paris, France.
Bioinformatics. 2007 Jan 1;23(1):44-9. doi: 10.1093/bioinformatics/btl550. Epub 2006 Nov 8.
Modern methods such as microarrays, proteomics and metabolomics often produce datasets where there are many more predictor variables than observations. Research in these areas is often exploratory; even so, there is interest in statistical methods that accurately point to effects that are likely to replicate. Correlations among predictors are used to improve the statistical analysis. We exploit two ideas: non-negative matrix factorization methods that create ordered sets of predictors; and statistical testing within ordered sets which is done sequentially, removing the need for correction for multiple testing within the set.
Simulations and theory point to increased statistical power. Computational algorithms are described in detail. The analysis and biological interpretation of a real dataset are given. In addition to the increased power, the benefit of our method is that the organized gene lists are likely to lead better understanding of the biology.
An SAS JMP executable script is available from http://www.niss.org/irMF
诸如微阵列、蛋白质组学和代谢组学等现代方法常常产生预测变量比观测值多得多的数据集。这些领域的研究往往具有探索性;即便如此,人们仍对能准确指出可能重复出现的效应的统计方法感兴趣。预测变量之间的相关性被用于改进统计分析。我们利用了两个思路:创建预测变量有序集的非负矩阵分解方法;以及在有序集内进行的顺序统计检验,从而无需对集合内的多重检验进行校正。
模拟和理论表明统计功效有所提高。详细描述了计算算法。给出了一个真实数据集的分析和生物学解释。除了功效提高之外,我们方法的好处在于这些有组织的基因列表可能会带来对生物学更好的理解。
可从http://www.niss.org/irMF获得一个SAS JMP可执行脚本。