Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University,Guangzhou 510275, China.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1649-62. doi: 10.1109/TCBB.2012.105.
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously
生物标志物的识别和癌症分类是两个密切相关的问题。在基因表达数据集,当基因共享相同的生物途径时,它们之间的相关性可能很高。此外,由于化学或电气原因,基因表达数据集可能包含异常值。一个好的基因选择方法应该考虑到组效应并且对异常值具有鲁棒性。在本文中,我们提出了一种具有均值收缩的拉普拉斯朴素贝叶斯模型(LNB-MS)。之所以选择拉普拉斯分布而不是正态分布作为样本的条件分布,是因为它对异常值的敏感性较低,并且已经在许多领域得到了应用。关键技术是对每个类别的均值施加 L1 惩罚,以实现自动特征选择。所提出模型的目标函数是关于每个类别的均值的分段线性函数,其最优值可以在断点处简单地评估。设计了一种有效的算法来估计模型中的参数。引入了一种使用所选特征的数量来控制正则化参数的新策略。在模拟数据集和 17 个公开可用的癌症数据集上的实验结果证明了所提出算法的准确性、稀疏性、效率和鲁棒性。我们的方法识别出的许多生物标志物已经在生化或生物医学研究中得到了验证。基于基因本体论(GO)术语对基因的生物和功能相关性的分析表明,该方法可以保证同时选择高度相关的基因。