College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China.
College of Information Technology, Henan University of Chinese Medicine, Zhengzhou, 450046, China.
Comput Biol Med. 2018 Sep 1;100:1-9. doi: 10.1016/j.compbiomed.2018.06.014. Epub 2018 Jun 19.
Multi-class classification has attracted much attention in cancer diagnosis and treatment and many machine learning methods have emerged for addressing this issue recently. However, class imbalance and gene selection problems occur in classifying lung cancer data. In this paper, an adaptive multinomial regression with a sparse overlapping group lasso penalty is proposed to perform classification and grouped gene selection for lung cancer gene expression data. An overlapped grouping strategy with biological interpretability is proposed, which highlights the importance of gene groups from the minority classes. By using the conditional mutual information, the gene significance within each group is evaluated and the data-driven weights are constructed. Based on the grouping strategy and constructed weights, a regularized adaptive multinomial regression is presented and the solving algorithm is developed, which can not only select the important gene groups for each class in performing multi-class classification, but also adaptively select important genes within each group. The experiment results show that the proposed method significantly outperforms the other 6 methods on classification accuracy, and the selected genes are disease-causing genes for lung cancer.
多类分类在癌症诊断和治疗中受到了广泛关注,最近出现了许多用于解决这个问题的机器学习方法。然而,在对肺癌数据进行分类时,会出现类不平衡和基因选择问题。本文提出了一种自适应多项回归稀疏重叠组套索惩罚方法,用于对肺癌基因表达数据进行分类和分组基因选择。提出了一种具有生物学可解释性的重叠分组策略,突出了少数类基因组的重要性。通过使用条件互信息,评估了每个组内的基因显著性,并构建了数据驱动的权重。基于分组策略和构建的权重,提出了一种正则化自适应多项回归,并开发了求解算法,该算法不仅可以在执行多类分类时为每个类选择重要的基因组,而且可以自适应地选择每个组内的重要基因。实验结果表明,该方法在分类准确性方面明显优于其他 6 种方法,所选基因是导致肺癌的致病基因。