Nguyen Tuan S, Rojo Javier
Rice University.
Stat Appl Genet Mol Biol. 2009;8(1):Article 4. doi: 10.2202/1544-6115.1395. Epub 2009 Jan 21.
An important aspect of microarray studies involves the prediction of patient survival based on their gene expression levels. To cope with the high dimensionality of the microarray gene expression data, it is customary to first reduce the dimension of the gene expression data via dimension reduction methods, and then use the Cox proportional hazards model to predict patient survival. In this paper, we propose a variant of Partial Least Squares, denoted as Rank-based Modified Partial Least Squares (RMPLS), that is insensitive to outlying values of both the response and the gene expressions. We assess the performance of RMPLS and several dimension reduction methods using a simulation model for gene expression data with a censored response. In particular, Principal Component Analysis (PCA), modified Partial Least Squares (MPLS), RMPLS, Sliced Inverse Regression (SIR), Correlation Principal Component Regression (CPCR), Supervised Principal Component Regression (SPCR) and Univariate Selection (UNIV) are compared in terms of mean squared error of the estimated survival function and the estimated coefficients of the covariates, and in terms of the bias of the estimated survival function. It turns out that RMPLS outperforms all other methods in terms of the mean squared error and the bias of the survival function in the presence of outliers in the response. In addition, RMPLS is comparable to MPLS in the absence of outliers. In this setting, both RMPLS and MPLS outperform all other methods considered in this study in terms of mean squared error and bias of the estimated survival function.
微阵列研究的一个重要方面涉及根据患者的基因表达水平预测其生存情况。为了应对微阵列基因表达数据的高维度问题,通常首先通过降维方法降低基因表达数据的维度,然后使用Cox比例风险模型预测患者生存情况。在本文中,我们提出了偏最小二乘法的一种变体,称为基于秩的修正偏最小二乘法(RMPLS),它对响应变量和基因表达的异常值均不敏感。我们使用具有删失响应的基因表达数据模拟模型评估了RMPLS和几种降维方法的性能。具体而言,比较了主成分分析(PCA)、修正偏最小二乘法(MPLS)、RMPLS、切片逆回归(SIR)、相关主成分回归(CPCR)、监督主成分回归(SPCR)和单变量选择(UNIV)在估计生存函数的均方误差、协变量估计系数以及估计生存函数偏差方面的表现。结果表明,在响应变量存在异常值的情况下,就生存函数的均方误差和偏差而言,RMPLS优于所有其他方法。此外,在不存在异常值的情况下,RMPLS与MPLS相当。在这种情况下,就估计生存函数的均方误差和偏差而言,RMPLS和MPLS均优于本研究中考虑的所有其他方法。