Cui Xinping, Wilson Jason
Department of Statistics, University of California, Riverside, CA 92521, USA.
Biom J. 2008 Oct;50(5):870-83. doi: 10.1002/bimj.200710457.
One frontier of modern statistical research is the problems arising from data sets with extremely large k (>1000) populations, e.g. microarray and neuroimaging data. For many such problems the focus shifts from testing for significance to selecting, filtering, or screening. Classical Ranking and Selection Methodology (RSM) studied the probability of correct selection (PCS). PCS is the probability that the "best" (t = 1) of k populations is truly selected, according to some specified criteria of best. This paper extends and adapts two selection goals from the RSM literature that are suitable for large k problems (d-best and G-best selection). It is then shown how estimation of PCS for selecting multiple (t > 1) populations with d-best and G-best selection can be implemented to provide a useful measure of the quality of a given selection. A simulation study and the application of the proposed method to a benchmark microarray data set show it is an effective and versatile tool for assessing the probability that a particular gene selection or gene filtering step truly obtains the best genes. Moreover, the proposed method is fully general and may be applied to any such extremely large k problem.
现代统计研究的一个前沿领域是来自具有极大总体数量(k>1000)的数据集所产生的问题,例如微阵列和神经成像数据。对于许多此类问题,重点从显著性检验转移到选择、过滤或筛选。经典的排序与选择方法(RSM)研究了正确选择概率(PCS)。PCS是根据某些指定的最佳标准,k个总体中“最佳”(t = 1)的那个被真正选中的概率。本文扩展并改编了RSM文献中的两个适合大k问题的选择目标(d-最佳和G-最佳选择)。然后展示了如何通过实施d-最佳和G-最佳选择来估计选择多个(t > 1)总体时的PCS,以提供给定选择质量的有用度量。一项模拟研究以及将所提出的方法应用于一个基准微阵列数据集表明,它是评估特定基因选择或基因过滤步骤真正获得最佳基因概率的有效且通用的工具。此外,所提出的方法具有完全的通用性,可应用于任何此类极大k问题。