使用广义似然比检验识别差异表达基因的统计数据模型比较。

Comparison of statistical data models for identifying differentially expressed genes using a generalized likelihood ratio test.

作者信息

Seng Kok-Yong, Glenny Robb W, Madtes David K, Spilker Mary E, Vicini Paolo, Gharib Sina A

机构信息

Department of Bioengineering, University of Washington, Seattle, Washington, USA.

出版信息

Gene Regul Syst Bio. 2008;2:125-139. doi: 10.4137/grsb.s381.

DOI:10.4137/grsb.s381

PMID:19119428

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2613008/

Abstract

Currently, statistical techniques for analysis of microarray-generated data sets have deficiencies due to limited understanding of errors inherent in the data. A generalized likelihood ratio (GLR) test based on an error model has been recently proposed to identify differentially expressed genes from microarray experiments. However, the use of different error structures under the GLR test has not been evaluated, nor has this method been compared to commonly used statistical tests such as the parametric t-test. The concomitant effects of varying data signal-to-noise ratio and replication number on the performance of statistical tests also remain largely unexplored. In this study, we compared the effects of different underlying statistical error structures on the GLR test's power in identifying differentially expressed genes in microarray data. We evaluated such variants of the GLR test as well as the one sample t-test based on simulated data by means of receiver operating characteristic (ROC) curves. Further, we used bootstrapping of ROC curves to assess statistical significance of differences between the areas under the curves. Our results showed that i) the GLR tests outperformed the t-test for detecting differential gene expression, ii) the identity of the underlying error structure was important in determining the GLR tests' performance, and iii) signal-to-noise ratio was a more important contributor than sample replication in identifying statistically significant differential gene expression.

摘要

目前，由于对微阵列生成数据集中固有误差的理解有限，用于分析此类数据集的统计技术存在缺陷。最近有人提出了一种基于误差模型的广义似然比（GLR）检验，用于从微阵列实验中识别差异表达基因。然而，GLR检验中不同误差结构的使用尚未得到评估，该方法也未与常用统计检验（如参数t检验）进行比较。数据信噪比和重复次数变化对统计检验性能的协同影响在很大程度上也未得到探索。在本研究中，我们比较了不同潜在统计误差结构对GLR检验在识别微阵列数据中差异表达基因时功效的影响。我们通过受试者操作特征（ROC）曲线，基于模拟数据评估了GLR检验的此类变体以及单样本t检验。此外，我们使用ROC曲线的自抽样法来评估曲线下面积差异的统计显著性。我们的结果表明：i）在检测差异基因表达方面，GLR检验优于t检验；ii）潜在误差结构的特性对于确定GLR检验的性能很重要；iii）在识别具有统计学显著性的差异基因表达时，信噪比比样本重复更重要。