Luan Yihui, Li Hongzhe
Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-6021, USA.
Biostatistics. 2008 Jan;9(1):100-13. doi: 10.1093/biostatistics/kxm015. Epub 2007 May 18.
One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.
基因组研究中的一个重要问题是识别与临床表型相关的基因组特征,如基因表达数据或DNA单核苷酸多态性(SNP)。通常,这些基因组数据可以自然地分为具有生物学意义的组,例如属于相同途径的基因或基因内的SNP。在本文中,我们提出了组加性回归模型和组梯度下降增强程序,用于识别与临床表型相关的基因组特征组。我们的模拟结果表明,通过将变量划分为适当的组,我们可以更好地识别与表型相关的组特征。此外,预测均方误差也小于逐分量增强程序。我们展示了这些方法在基于途径的乳腺癌微阵列基因表达数据分析中的应用。对一个乳腺癌微阵列基因表达数据集的分析结果表明,金属内肽酶(MMP)和MMP抑制剂的途径,以及细胞增殖、细胞生长和维持对乳腺癌特异性生存很重要。