Khalili Abbas, Lin Shili
Department of Mathematics and Statistics, McGill University, Montreal, Quebec, Canada H3A 2K6.
Biometrics. 2013 Jun;69(2):436-46. doi: 10.1111/biom.12020. Epub 2013 Apr 4.
Feature (variable) selection has become a fundamentally important problem in recent statistical literature. Sometimes, in applications, many variables are introduced to reduce possible modeling biases, but the number of variables a model can accommodate is often limited by the amount of data available. In other words, the number of variables considered depends on the sample size, which reflects the estimability of the parametric model. In this article, we consider the problem of feature selection in finite mixture of regression models when the number of parameters in the model can increase with the sample size. We propose a penalized likelihood approach for feature selection in these models. Under certain regularity conditions, our approach leads to consistent variable selection. We carry out extensive simulation studies to evaluate the performance of the proposed approach under controlled settings. We also applied the proposed method to two real data. The first is on telemonitoring of Parkinson's disease (PD), where the problem concerns whether dysphonic features extracted from the patients' speech signals recorded at home can be used as surrogates to study PD severity and progression. The second is on breast cancer prognosis, in which one is interested in assessing whether cell nuclear features may offer prognostic values on long-term survival of breast cancer patients. Our analysis in each of the application revealed a mixture structure in the study population and uncovered a unique relationship between the features and the response variable in each of the mixture component.
特征(变量)选择已成为近期统计文献中一个极其重要的问题。有时,在应用中会引入许多变量以减少可能的建模偏差,但模型能够容纳的变量数量通常受到可用数据量的限制。换句话说,所考虑的变量数量取决于样本大小,这反映了参数模型的可估计性。在本文中,当模型中的参数数量会随着样本大小增加时,我们考虑回归模型有限混合中的特征选择问题。我们提出了一种用于这些模型特征选择的惩罚似然方法。在某些正则条件下,我们的方法会导致一致的变量选择。我们进行了广泛的模拟研究,以评估所提出方法在受控设置下的性能。我们还将所提出的方法应用于两个实际数据。第一个是关于帕金森病(PD)的远程监测,问题在于从在家记录的患者语音信号中提取的发声特征是否可以用作研究PD严重程度和进展的替代指标。第二个是关于乳腺癌预后,其中人们感兴趣的是评估细胞核特征是否可能对乳腺癌患者的长期生存提供预后价值。我们在每个应用中的分析都揭示了研究人群中的混合结构,并揭示了每个混合成分中特征与响应变量之间的独特关系。