Azevedo Costa Marcelo, de Souza Rodrigues Thiago, da Costa André Gabriel Fc, Natowicz René, Pádua Braga Antônio
1 Department of Industrial Engineering, Universidade Federal de Minas Gerais, Belo Horizonte,Brazil.
2 Computer Department, Centro Federal de Educação Tecnológica Minas Gerais, Brazil.
Stat Methods Med Res. 2017 Apr;26(2):997-1020. doi: 10.1177/0962280214566262. Epub 2015 Jan 9.
This work proposes a sequential methodology for selecting variables in classification problems in which the number of predictors is much larger than the sample size. The methodology includes a Monte Carlo permutation procedure that conditionally tests the null hypothesis of no association among the outcomes and the available predictors. In order to improve computing aspects, we propose a new parametric distribution, the Truncated and Zero Inflated Gumbel Distribution. The final application is to find compact classification models with improved performance for genomic data. Results using real data sets show that the proposed methodology selects compact models with optimized classification performances.
这项工作提出了一种用于在预测变量数量远大于样本量的分类问题中选择变量的序贯方法。该方法包括一个蒙特卡洛排列程序,用于有条件地检验结果与可用预测变量之间无关联的原假设。为了改进计算方面,我们提出了一种新的参数分布,即截断零膨胀耿贝尔分布。最终应用是为基因组数据找到具有改进性能的紧凑分类模型。使用真实数据集的结果表明,所提出的方法选择了具有优化分类性能的紧凑模型。