Nengsih Titin Agustin, Bertrand Frédéric, Maumy-Bertrand Myriam, Meyer Nicolas
IRMA, CNRS UMR 7501, Université de Strasbourg, 67084 Strasbourg, Cedex, France.
iCUBE, CNRS UMR 7357, Université de Strasbourg, 67400 Strasbourg, France.
Stat Appl Genet Mol Biol. 2019 Nov 6;18(6):/j/sagmb.2019.18.issue-6/sagmb-2018-0059/sagmb-2018-0059.xml. doi: 10.1515/sagmb-2018-0059.
Partial least squares regression - or PLS regression - is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.
偏最小二乘回归(PLS回归)是一种多元方法,其中模型参数使用SIMPLS或NIPALS算法进行估计。PLS回归因其在分析结果与一个或多个成分之间关系方面的有效性而在应用研究中得到广泛应用。请注意,NIPALS算法可以在不完整数据上提供参数估计。在PLS回归中选择用于构建代表性模型的成分数量是一个核心问题。然而,在使用PLS回归时如何处理缺失数据仍然存在争议。文献中已经提出了几种方法,包括Q2准则、AIC和BIC准则。在这里,我们研究了NIPALS算法在用于拟合具有不同比例缺失数据和不同类型缺失情况的PLS回归时的行为。我们比较了在不完整数据集和使用三种插补方法(链式方程多重插补、k近邻插补和奇异值分解插补)的插补数据集上选择PLS回归成分数量的准则。我们在不同缺失假设下测试了具有不同比例缺失数据(从5%到50%)的各种准则。Q2留一法成分选择方法比基于AIC和BIC的方法给出了更可靠的结果。