Department of Epidemiology, Biostatistics and Occupational Health, Montréal, QC, Canada.
BMC Bioinformatics. 2009 Dec 10;10:410. doi: 10.1186/1471-2105-10-410.
In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected.
Our simulation results show that we can most often recover the correct subset of genes that predict the class as compared to other methods, even when accuracy and subset size remain the same. On real microarray datasets, we show that our approach can refine gene signatures to obtain either the same or better predictive performance than other existing methods with a smaller number of genes.
Our novel Bayesian approach includes a prior which penalizes highly correlated features in model selection and is able to extract key genes in the highly correlated context of microarray data. The methodology in the paper is described in the context of microarray data, but can be applied to any array data (such as micro RNA, for example) as a first step towards predictive modeling of cancer pathways. A user-friendly software implementation of the method is available.
在高密度阵列中,不仅受到维度诅咒的影响,而且还受到阵列数据的高度相关性的影响,因此,识别与疾病分类相关的基因变得更加复杂。在本文中,我们感兴趣的问题是,应该选择多少个和哪些基因用于疾病类别预测。我们的工作包括一种贝叶斯监督统计学习方法,该方法通过正则化来细化基因特征,从而惩罚所选变量之间的相关性。
我们的模拟结果表明,与其他方法相比,我们通常可以更经常地恢复正确的基因子集,以预测类别,即使准确性和子集大小保持不变。在真实的微阵列数据集上,我们表明,我们的方法可以细化基因特征,以获得与其他现有方法相同或更好的预测性能,同时使用的基因数量更少。
我们的新贝叶斯方法包括一个先验,该先验惩罚模型选择中高度相关的特征,并能够在微阵列数据的高度相关环境中提取关键基因。本文中描述的方法可应用于任何阵列数据(例如 micro RNA 等),作为癌症途径预测建模的第一步。该方法的用户友好型软件实现可提供。