Klebanov Lev, Yakovlev Andrei
Stat Appl Genet Mol Biol. 2006;5:Article9. doi: 10.2202/1544-6115.1185. Epub 2006 Mar 24.
One of the prevailing ideas in the literature on microarray data analysis is to pool the expression measures across genes and treat them as a sample drawn from some distribution. Several universal laws were proposed to analytically describe this distribution. This idea raises a number of concerns. The expression levels of genes are not identically distributed random variables so that treating them as a sample amounts to sampling from a mixture of equally weighted distributions, each being associated with a different gene. The expression levels of different genes are heavily dependent random variables so that the law of large numbers and statistical goodness-of-fit tests are normally inapplicable to this kind of data. This dependence represents a very serious pitfall in microarray data analysis.
微阵列数据分析文献中一个普遍的观点是,将基因间的表达量进行汇总,并将它们视为从某种分布中抽取的一个样本。人们提出了若干通用法则来对这种分布进行分析性描述。这一观点引发了诸多问题。基因的表达水平并非独立同分布的随机变量,因此将它们视为一个样本相当于从等权重分布的混合体中进行抽样,每个分布都与一个不同的基因相关联。不同基因的表达水平是高度相关的随机变量,所以大数定律和统计拟合优度检验通常不适用于这类数据。这种相关性在微阵列数据分析中是一个非常严重的缺陷。