Huang Grace T, Tsamardinos Ioannis, Raghu Vineet, Kaminski Naftali, Benos Panayiotis V
Department of Computational and Systems Biology, and Joint CMU-Pitt PhD Program in computational Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA 15260, USA.
Pac Symp Biocomput. 2015;20:431-42.
Feature selection is used extensively in biomedical research for biomarker identification and patient classification, both of which are essential steps in developing personalized medicine strategies. However, the structured nature of the biological datasets and high correlation of variables frequently yield multiple equally optimal signatures, thus making traditional feature selection methods unstable. Features selected based on one cohort of patients, may not work as well in another cohort. In addition, biologically important features may be missed due to selection of other co-clustered features We propose a new method, Tree-guided Recursive Cluster Selection (T-ReCS), for efficient selection of grouped features. T-ReCS significantly improves predictive stability while maintains the same level of accuracy. T-ReCS does not require an a priori knowledge of the clusters like group-lasso and also can handle "orphan" features (not belonging to a cluster). T-ReCS can be used with categorical or survival target variables. Tested on simulated and real expression data from breast cancer and lung diseases and survival data, T-ReCS selected stable cluster features without significant loss in classification accuracy.
特征选择在生物医学研究中被广泛用于生物标志物识别和患者分类,这两者都是制定个性化医疗策略的关键步骤。然而,生物数据集的结构化性质和变量的高相关性经常产生多个同样最优的特征集,从而使传统的特征选择方法不稳定。基于一组患者选择的特征,在另一组患者中可能效果不佳。此外,由于选择了其他共聚类特征,可能会遗漏生物学上重要的特征。我们提出了一种新的方法,树引导递归聚类选择(T-ReCS),用于高效选择分组特征。T-ReCS显著提高了预测稳定性,同时保持了相同的准确率水平。T-ReCS不像组套索那样需要聚类的先验知识,并且还可以处理“孤立”特征(不属于任何聚类的特征)。T-ReCS可用于分类或生存目标变量。在来自乳腺癌和肺部疾病的模拟和真实表达数据以及生存数据上进行测试,T-ReCS选择了稳定的聚类特征,且分类准确率没有显著损失。