Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan.
Comput Math Methods Med. 2013;2013:860673. doi: 10.1155/2013/860673. Epub 2013 Apr 4.
Large-p-small-n datasets are commonly encountered in modern biomedical studies. To detect the difference between two groups, conventional methods would fail to apply due to the instability in estimating variances in t-test and a high proportion of tied values in AUC (area under the receiver operating characteristic curve) estimates. The significance analysis of microarrays (SAM) may also not be satisfactory, since its performance is sensitive to the tuning parameter, and its selection is not straightforward. In this work, we propose a robust rerank approach to overcome the above-mentioned diffculties. In particular, we obtain a rank-based statistic for each feature based on the concept of "rank-over-variable." Techniques of "random subset" and "rerank" are then iteratively applied to rank features, and the leading features will be selected for further studies. The proposed re-rank approach is especially applicable for large-p-small-n datasets. Moreover, it is insensitive to the selection of tuning parameters, which is an appealing property for practical implementation. Simulation studies and real data analysis of pooling-based genome wide association (GWA) studies demonstrate the usefulness of our method.
在现代生物医学研究中,经常会遇到大 p-小 n 数据集。由于传统方法在 t 检验中估计方差不稳定,AUC(接收者操作特征曲线下的面积)估计中存在大量的 tied 值,因此无法应用于检测两组之间的差异。基因芯片的显著性分析(SAM)也可能不尽如人意,因为它的性能对调谐参数很敏感,并且其选择并不直接。在这项工作中,我们提出了一种稳健的重排方法来克服上述困难。具体来说,我们基于“变量之上的秩”的概念,为每个特征获得一个基于秩的统计量。然后,使用“随机子集”和“重排”技术迭代地对特征进行重排,选择主要特征进行进一步研究。所提出的重排方法特别适用于大 p-小 n 数据集。此外,它对调谐参数的选择不敏感,这是实际实施的一个吸引人的特性。基于池的全基因组关联(GWA)研究的模拟研究和实际数据分析证明了我们方法的有用性。