Suppr超能文献

基于鲁棒拉普拉斯监督判别稀疏 PCA 的特征基因选择与肿瘤分类

Enhancing Characteristic Gene Selection and Tumor Classification by the Robust Laplacian Supervised Discriminative Sparse PCA.

机构信息

School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China.

Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.

出版信息

J Chem Inf Model. 2022 Apr 11;62(7):1794-1807. doi: 10.1021/acs.jcim.1c01403. Epub 2022 Mar 30.

Abstract

Characteristic gene selection and tumor classification of gene expression data play major roles in genomic research. Due to the characteristics of a small sample size and high dimensionality of gene expression data, it is a common practice to perform dimensionality reduction prior to the use of machine learning-based methods to analyze the expression data. In this context, classical principal component analysis (PCA) and its improved versions have been widely used. Recently, methods based on supervised discriminative sparse PCA have been developed to improve the performance of data dimensionality reduction. However, such methods still have limitations: most of them have not taken into consideration the improvement of robustness to outliers and noise, label information, sparsity, as well as capturing intrinsic geometrical structures in one objective function. To address this drawback, in this study, we propose a novel PCA-based method, known as the robust Laplacian supervised discriminative sparse PCA, termed RLSDSPCA, which enforces the L2,1 norm on the error function and incorporates the graph Laplacian into supervised discriminative sparse PCA. To evaluate the efficacy of the proposed RLSDSPCA, we applied it to the problems of characteristic gene selection and tumor classification problems using gene expression data. The results demonstrate that the proposed RLSDSPCA method, when used in combination with other related methods, can effectively identify new pathogenic genes associated with diseases. In addition, RLSDSPCA has also achieved the best performance compared with the state-of-the-art methods on tumor classification in terms of major performance metrics. The codes and data sets used in the study are freely available at http://csbio.njust.edu.cn/bioinf/rlsdspca/.

摘要

特征基因选择和基因表达数据的肿瘤分类在基因组研究中起着重要作用。由于基因表达数据的样本量小和维度高的特点,在使用基于机器学习的方法分析表达数据之前,通常需要进行降维。在这种情况下,经典的主成分分析(PCA)及其改进版本得到了广泛的应用。最近,基于有监督判别稀疏 PCA 的方法已经被开发出来,以提高数据降维的性能。然而,这些方法仍然存在局限性:大多数方法都没有考虑到提高对离群值和噪声、标签信息、稀疏性以及在一个目标函数中捕获内在几何结构的鲁棒性。为了解决这个缺点,在本研究中,我们提出了一种新的基于 PCA 的方法,称为鲁棒拉普拉斯监督判别稀疏 PCA,称为 RLSDSPCA,它在误差函数上施加 L2,1 范数,并将图拉普拉斯纳入到监督判别稀疏 PCA 中。为了评估所提出的 RLSDSPCA 的有效性,我们将其应用于使用基因表达数据进行特征基因选择和肿瘤分类问题。结果表明,所提出的 RLSDSPCA 方法与其他相关方法相结合,可以有效地识别与疾病相关的新致病基因。此外,RLSDSPCA 在肿瘤分类方面的主要性能指标上也优于最新方法。研究中使用的代码和数据集可在 http://csbio.njust.edu.cn/bioinf/rlsdspca/ 上免费获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验