基因差异表达多重检验中一类错误与二类错误的平衡

Balancing Type One and Two Errors in Multiple Testing for Differential Expression of Genes.

作者信息

Gordon Alexander, Chen Linlin, Glazko Galina, Yakovlev Andrei

机构信息

Department of Mathematics and Statistics, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, North Carolina, U.S.A.

出版信息

Comput Stat Data Anal. 2009 Mar 15;53(5):1622-1629. doi: 10.1016/j.csda.2008.04.010.

DOI:10.1016/j.csda.2008.04.010

PMID:20161303

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2699298/

Abstract

A new procedure is proposed to balance type I and II errors in significance testing for differential expression of individual genes. Suppose that a collection, F(k), of k lists of selected genes is available, each of them approximating by their content the true set of differentially expressed genes. For example, such sets can be generated by a subsampling counterpart of the delete-d-jackknife method controlling the per-comparison error rate for each subsample. A final list of candidate genes, denoted by S(), is composed in such a way that its contents be closest in some sense to all the sets thus generated. To measure "closeness" of gene lists, we introduce an asymmetric distance between sets with its asymmetry arising from a generally unequal assignment of the relative costs of type I and type II errors committed in the course of gene selection. The optimal set S() is defined as a minimizer of the average asymmetric distance from an arbitrary set S to all sets in the collection F(k). The minimization problem can be solved explicitly, leading to a frequency criterion for the inclusion of each gene in the final set. The proposed method is tested by resampling from real microarray gene expression data with artificially introduced shifts in expression levels of pre-defined genes, thereby mimicking their differential expression.

摘要

提出了一种新方法，用于在个体基因差异表达的显著性检验中平衡I型错误和II型错误。假设存在一个由k个选定基因列表组成的集合F(k)，其中每个列表的内容都近似于真正的差异表达基因集。例如，这样的集合可以通过控制每个子样本的每次比较错误率的删除d折刀法的子采样对应方法生成。以这样一种方式组成候选基因的最终列表S()，使得其内容在某种意义上最接近由此生成的所有集合。为了衡量基因列表的“接近度”，我们引入了集合之间的不对称距离，其不对称性源于在基因选择过程中I型错误和II型错误的相对成本通常分配不均。最优集S()被定义为任意集S到集合F(k)中所有集合的平均不对称距离的最小化者。最小化问题可以明确求解，从而得到每个基因纳入最终集合的频率标准。通过对真实微阵列基因表达数据进行重采样来测试所提出的方法，其中预先定义的基因的表达水平被人为地引入偏移，从而模拟它们的差异表达。