Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad, Iran.
Department of Statistics, Faculty of Mathematics, Statistics and Computer Sciences, Semnan University, Semnan, Iran.
PLoS One. 2021 Apr 8;16(4):e0245376. doi: 10.1371/journal.pone.0245376. eCollection 2021.
With the advancement of technology, analysis of large-scale data of gene expression is feasible and has become very popular in the era of machine learning. This paper develops an improved ridge approach for the genome regression modeling. When multicollinearity exists in the data set with outliers, we consider a robust ridge estimator, namely the rank ridge regression estimator, for parameter estimation and prediction. On the other hand, the efficiency of the rank ridge regression estimator is highly dependent on the ridge parameter. In general, it is difficult to provide a satisfactory answer about the selection for the ridge parameter. Because of the good properties of generalized cross validation (GCV) and its simplicity, we use it to choose the optimum value of the ridge parameter. The GCV function creates a balance between the precision of the estimators and the bias caused by the ridge estimation. It behaves like an improved estimator of risk and can be used when the number of explanatory variables is larger than the sample size in high-dimensional problems. Finally, some numerical illustrations are given to support our findings.
随着技术的进步,对大规模基因表达数据的分析变得可行,并且在机器学习时代变得非常流行。本文为基因组回归建模开发了一种改进的岭方法。当数据集中存在异常值时存在多重共线性,我们考虑一种稳健的岭估计量,即秩岭回归估计量,用于参数估计和预测。另一方面,秩岭回归估计量的效率高度依赖于岭参数。一般来说,很难为岭参数的选择提供令人满意的答案。由于广义交叉验证 (GCV) 的良好特性及其简单性,我们使用它来选择岭参数的最优值。GCV 函数在估计器的精度和岭估计引起的偏差之间取得平衡。它的行为类似于风险的改进估计量,并且可以在高维问题中解释变量的数量大于样本量时使用。最后,给出了一些数值说明来支持我们的发现。