Goeman Jelle J, Bühlmann Peter
Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands.
Bioinformatics. 2007 Apr 15;23(8):980-7. doi: 10.1093/bioinformatics/btm051. Epub 2007 Feb 15.
Many statistical tests have been proposed in recent years for analyzing gene expression data in terms of gene sets, usually from Gene Ontology. These methods are based on widely different methodological assumptions. Some approaches test differential expression of each gene set against differential expression of the rest of the genes, whereas others test each gene set on its own. Also, some methods are based on a model in which the genes are the sampling units, whereas others treat the subjects as the sampling units. This article aims to clarify the assumptions behind different approaches and to indicate a preferential methodology of gene set testing.
We identify some crucial assumptions which are needed by the majority of methods. P-values derived from methods that use a model which takes the genes as the sampling unit are easily misinterpreted, as they are based on a statistical model that does not resemble the biological experiment actually performed. Furthermore, because these models are based on a crucial and unrealistic independence assumption between genes, the P-values derived from such methods can be wildly anti-conservative, as a simulation experiment shows. We also argue that methods that competitively test each gene set against the rest of the genes create an unnecessary rift between single gene testing and gene set testing.
近年来,已经提出了许多统计检验方法,用于从基因集(通常来自基因本体论)的角度分析基因表达数据。这些方法基于截然不同的方法学假设。一些方法针对其余基因的差异表达来检验每个基因集的差异表达,而其他方法则单独检验每个基因集。此外,一些方法基于基因作为抽样单位的模型,而其他方法则将受试者视为抽样单位。本文旨在阐明不同方法背后的假设,并指出基因集检验的优先方法。
我们确定了大多数方法所需的一些关键假设。从将基因作为抽样单位的模型的方法中得出的P值很容易被误解,因为它们基于一个与实际进行的生物学实验不相似的统计模型。此外,由于这些模型基于基因之间关键且不现实的独立性假设,正如一个模拟实验所示,从这类方法中得出的P值可能会非常反保守。我们还认为,将每个基因集与其余基因进行竞争性检验的方法在单基因检验和基因集检验之间造成了不必要的分歧。