使用惩罚线性回归模型进行差异基因表达检测和样本分类。

Differential gene expression detection and sample classification using penalized linear regression models.

作者信息

Wu Baolin

机构信息

Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building, MMC 303, Minneapolis, MN 55455, USA.

出版信息

Bioinformatics. 2006 Feb 15;22(4):472-6. doi: 10.1093/bioinformatics/bti827. Epub 2005 Dec 13.

DOI:10.1093/bioinformatics/bti827

PMID:16352654

Abstract

Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p >> n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.

摘要

利用微阵列数据进行差异基因表达检测和样本分类，近年来受到了广泛的研究关注。由于基因数量(p)众多而样本数量(n)较少（(p\gg n)），微阵列数据分析给统计分析带来了巨大挑战。“大(p)小(n)”带来的一个明显问题是过拟合。仅仅是偶然，我们就可能找到一些非差异表达基因，它们能很好地对样本进行分类。收缩的思想是对模型参数进行正则化，以减少噪声的影响并产生可靠的推断。收缩已成功应用于微阵列数据分析。Tusher等人提出的SAM统计量以及Tibshirani等人提出的“最近收缩质心”都是特殊的收缩方法。这两种方法都简单、直观，并且在实证研究中证明是有用的。最近，Wu通过正式使用用于两类微阵列数据的（1）惩罚线性回归模型，提出了具有收缩的惩罚(t/F)统计量，表现良好。在本文中，我们系统地讨论了惩罚回归模型在分析微阵列数据中的应用。我们将Wu提出的两类惩罚(t/F)统计量推广到多类微阵列数据。我们使用（1）惩罚回归模型正式推导了Tibshirani等人使用的特殊收缩质心。并且我们表明，惩罚线性回归模型为样本分类和差异基因表达检测提供了一个严谨统一的统计框架。

相似文献

Differential gene expression detection and sample classification using penalized linear regression models.

Bioinformatics. 2006 Feb 15;22(4):472-6. doi: 10.1093/bioinformatics/bti827. Epub 2005 Dec 13.

Improved centroids estimation for the nearest shrunken centroid classifier.

Bioinformatics. 2007 Apr 15;23(8):972-9. doi: 10.1093/bioinformatics/btm046. Epub 2007 Mar 24.

Independent component analysis-based penalized discriminant method for tumor classification using gene expression data.

Bioinformatics. 2006 Aug 1;22(15):1855-62. doi: 10.1093/bioinformatics/btl190. Epub 2006 May 18.

Cancer classification and prediction using logistic regression with Bayesian gene selection.

J Biomed Inform. 2004 Aug;37(4):249-59. doi: 10.1016/j.jbi.2004.07.009.

Classification of microarray data with factor mixture models.

Bioinformatics. 2006 Jan 15;22(2):202-8. doi: 10.1093/bioinformatics/bti779. Epub 2005 Nov 15.

Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data.

Bioinformatics. 2005 Jul 1;21(13):3001-8. doi: 10.1093/bioinformatics/bti422. Epub 2005 Apr 6.

Tumor classification ranking from microarray data.

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S21. doi: 10.1186/1471-2164-9-S2-S21.

Optimal number of features as a function of sample size for various classification rules.

Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.

An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer.

Artif Intell Med. 2008 Jan;42(1):81-93. doi: 10.1016/j.artmed.2007.09.004. Epub 2007 Nov 19.

A stable iterative method for refining discriminative gene clusters.

BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S18. doi: 10.1186/1471-2164-9-S2-S18.

引用本文的文献

A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma.

PLoS One. 2022 Sep 6;17(9):e0269126. doi: 10.1371/journal.pone.0269126. eCollection 2022.

Characterizing Human Cell Types and Tissue Origin Using the Benford Law.

Cells. 2019 Aug 29;8(9):1004. doi: 10.3390/cells8091004.

Penalized negative binomial models for modeling an overdispersed count outcome with a high-dimensional predictor space: Application predicting micronuclei frequency.

PLoS One. 2019 Jan 8;14(1):e0209923. doi: 10.1371/journal.pone.0209923. eCollection 2019.

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique.

Int J Mol Sci. 2018 Oct 30;19(11):3398. doi: 10.3390/ijms19113398.

Robust determination of differential abundance in shotgun proteomics using nonparametric statistics.

Mol Omics. 2018 Dec 3;14(6):424-436. doi: 10.1039/c8mo00077h.

EPS-LASSO: test for high-dimensional regression under extreme phenotype sampling of continuous traits.

Bioinformatics. 2018 Jun 15;34(12):1996-2003. doi: 10.1093/bioinformatics/bty042.

Identification of significant features in DNA microarray data.

Wiley Interdiscip Rev Comput Stat. 2013 Jul;5(4). doi: 10.1002/wics.1260.

Sparse regularized discriminant analysis with application to microarrays.

Comput Biol Chem. 2012 Aug;39:14-9. doi: 10.1016/j.compbiolchem.2012.06.001. Epub 2012 Jul 4.

L1 penalized continuation ratio models for ordinal response prediction using high-dimensional datasets.

Stat Med. 2012 Jun 30;31(14):1464-74. doi: 10.1002/sim.4484. Epub 2012 Feb 23.

Statistical methods for integrating multiple types of high-throughput data.

Methods Mol Biol. 2010;620:511-29. doi: 10.1007/978-1-60761-580-4_19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用惩罚线性回归模型进行差异基因表达检测和样本分类。

Differential gene expression detection and sample classification using penalized linear regression models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献