Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China.
Bioinformatics. 2022 Mar 4;38(6):1542-1549. doi: 10.1093/bioinformatics/btab848.
Efficiently identifying genes based on gene expression level have been studied to help to classify different cancer types and improve the prediction performance. Logistic regression model based on regularization technique is often one of the effective approaches for simultaneously realizing prediction and feature (gene) selection in genomic data of high dimensionality. However, standard methods ignore biological group structure and generally result in poorer predictive models.
In this article, we develop a classifier named Stacked SGL that satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group. We propose to use stacked generalization to combine different ratios rather than choosing one ratio, which could help to overcome the inadaptability of sparse group lasso for some data. Considering that stacking weakens feature selection, we perform a post hoc feature selection which might slightly reduce predictive performance, but it shows superior in feature selection. Experimental results on simulation demonstrate that our approach enjoys competitive and stable classification performance and lower false discovery rate in feature selection for varying sets of data compared with other regularization methods. In addition, our method presents better accuracy in three public cancer datasets and identifies more powerful discriminatory and potential mutation genes for thyroid carcinoma.
The real data underlying this article are available from https://github.com/huanheaha/Stacked_SGL; https://zenodo.org/record/5761577#.YbAUyciEwk2.
Supplementary data are available at Bioinformatics online.
基于基因表达水平对基因进行有效识别的研究有助于对不同癌症类型进行分类,并提高预测性能。基于正则化技术的逻辑回归模型通常是在高维基因组数据中同时实现预测和特征(基因)选择的有效方法之一。然而,标准方法忽略了生物学组结构,通常会导致预测模型较差。
在本文中,我们开发了一种名为 Stacked SGL 的分类器,该分类器通过堆叠满足稀疏组套索惩罚的预测、稳定性和选择标准。稀疏组套索具有一个混合参数,表示套索与组套索的比例,从而在选择稀疏特征组的子集和在每个组内引入稀疏性之间提供了一种折衷。我们建议使用堆叠泛化来组合不同的比率,而不是选择一个比率,这有助于克服稀疏组套索对某些数据的不适应性。考虑到堆叠会削弱特征选择,我们进行了事后特征选择,这可能会略微降低预测性能,但在特征选择方面表现出色。模拟实验结果表明,与其他正则化方法相比,我们的方法在不同数据集的分类性能具有竞争力且稳定,在特征选择中具有较低的假发现率。此外,我们的方法在三个公开的癌症数据集上表现出更好的准确性,并为甲状腺癌识别出更强大的鉴别和潜在突变基因。
本文所依据的真实数据可从以下网址获得:https://github.com/huanheaha/Stacked_SGL;https://zenodo.org/record/5761577#.YbAUyciEwk2。
补充数据可在生物信息学在线获得。