Calle M Luz, Urrea Victor, Boulesteix Anne-Laure, Malats Nuria
Systems Biology Department, University of Vic, Spain. malu.calle @ uvic.cat
Hum Hered. 2011;72(2):121-32. doi: 10.1159/000330778. Epub 2011 Oct 11.
Genomic profiling, the use of genetic variants at multiple loci simultaneously for the prediction of disease risk, requires the selection of a set of genetic variants that best predicts disease status. The goal of this work was to provide a new selection algorithm for genomic profiling.
We propose a new algorithm for genomic profiling based on optimizing the area under the receiver operating characteristic curve (AUC) of the random forest (RF). The proposed strategy implements a backward elimination process based on the initial ranking of variables.
We demonstrate the advantage of using the AUC instead of the classification error as a measure of predictive accuracy of RF. In particular, we show that the use of the classification error is especially inappropriate when dealing with unbalanced data sets. The new procedure for variable selection and prediction, namely AUC-RF, is illustrated with data from a bladder cancer study and also with simulated data. The algorithm is publicly available as an R package, named AUCRF, at http://cran.r-project.org/.
基因组分析,即同时利用多个位点的基因变异来预测疾病风险,需要选择一组能最佳预测疾病状态的基因变异。这项工作的目标是提供一种用于基因组分析的新选择算法。
我们基于优化随机森林(RF)的受试者工作特征曲线(AUC)下的面积,提出了一种用于基因组分析的新算法。所提出的策略基于变量的初始排名实施向后消除过程。
我们证明了使用AUC而非分类误差作为RF预测准确性度量的优势。特别是,我们表明在处理不平衡数据集时,使用分类误差尤其不合适。通过膀胱癌研究的数据以及模拟数据说明了用于变量选择和预测的新程序,即AUC-RF。该算法作为一个名为AUCRF的R包在http://cran.r-project.org/上公开可用。