IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):312-321. doi: 10.1109/TCBB.2017.2767589. Epub 2017 Oct 30.
In gene expression data analysis, the problems of cancer classification and gene selection are closely related. Successfully selecting informative genes will significantly improve the classification performance. To identify informative genes from a large number of candidate genes, various methods have been proposed. However, the gene expression data may include some important correlation structures, and some of the genes can be divided into different groups based on their biological pathways. Many existing methods do not take into consideration the exact correlation structure within the data. Therefore, from both the knowledge discovery and biological perspectives, an ideal gene selection method should take this structural information into account. Moreover, the better generalization performance can be obtained by discovering correlation structure within data. In order to discover structure information among data and improve learning performance, we propose a structured penalized logistic regression model which simultaneously performs feature selection and model learning for gene expression data analysis. An efficient coordinate descent algorithm has been developed to optimize the model. The numerical simulation studies demonstrate that our method is able to select the highly correlated features. In addition, the results from real gene expression datasets show that the proposed method performs competitively with respect to previous approaches.
在基因表达数据分析中,癌症分类和基因选择问题密切相关。成功选择信息丰富的基因将显著提高分类性能。为了从大量候选基因中识别信息丰富的基因,已经提出了各种方法。然而,基因表达数据可能包含一些重要的相关结构,并且一些基因可以根据其生物途径分为不同的组。许多现有方法没有考虑到数据内部的确切相关结构。因此,从知识发现和生物学的角度来看,理想的基因选择方法应该考虑这种结构信息。此外,通过发现数据内部的相关结构,可以获得更好的泛化性能。为了发现数据之间的结构信息并提高学习性能,我们针对基因表达数据分析,提出了一种同时进行特征选择和模型学习的结构惩罚逻辑回归模型。开发了一种有效的坐标下降算法来优化模型。数值模拟研究表明,我们的方法能够选择高度相关的特征。此外,来自真实基因表达数据集的结果表明,与以前的方法相比,所提出的方法具有竞争力。