School of Pharmaceutical Sciences, Central South University, Changsha 410013, PR China.
Analyst. 2016 Oct 7;141(19):5586-97. doi: 10.1039/c6an00764c. Epub 2016 Jul 20.
Variable selection and outlier detection are important processes in chemical modeling. Usually, they affect each other. Their performing orders also strongly affect the modeling results. Currently, many studies perform these processes separately and in different orders. In this study, we examined the interaction between outliers and variables and compared the modeling procedures performed with different orders of variable selection and outlier detection. Because the order of outlier detection and variable selection can affect the interpretation of the model, it is difficult to decide which order is preferable when the predictabilities (prediction error) of the different orders are relatively close. To address this problem, a simultaneous variable selection and outlier detection approach called Model Adaptive Space Shrinkage (MASS) was developed. This proposed approach is based on model population analysis (MPA). Through weighted binary matrix sampling (WBMS) from model space, a large number of partial least square (PLS) regression models were built, and the elite parts of the models were selected to statistically reassign the weight of each variable and sample. Then, the whole process was repeated until the weights of the variables and samples converged. Finally, MASS adaptively found a high performance model which consisted of the optimized variable subset and sample subset. The combination of these two subsets could be considered as the cleaned dataset used for chemical modeling. In the proposed approach, the problem of the order of variable selection and outlier detection is avoided. One near infrared spectroscopy (NIR) dataset and one quantitative structure-activity relationship (QSAR) dataset were used to test this approach. The result demonstrated that MASS is a useful method for data cleaning before building a predictive model.
变量选择和异常值检测是化学建模中的重要过程。通常,它们相互影响。它们的执行顺序也强烈影响建模结果。目前,许多研究分别以不同的顺序执行这些过程。在这项研究中,我们检查了异常值和变量之间的相互作用,并比较了以不同变量选择和异常值检测顺序执行的建模过程。由于异常值检测和变量选择的顺序会影响模型的解释,因此当不同顺序的可预测性(预测误差)相对接近时,很难决定哪种顺序更可取。为了解决这个问题,我们开发了一种称为模型自适应空间收缩(MASS)的同时变量选择和异常值检测方法。该方法基于模型群体分析(MPA)。通过从模型空间进行加权二进制矩阵抽样(WBMS),构建了大量偏最小二乘(PLS)回归模型,并选择模型的精英部分来对每个变量和样本的权重进行统计重新分配和抽样。然后,重复整个过程,直到变量和样本的权重收敛。最后,MASS 自适应地找到了一个高性能模型,该模型由优化的变量子集和样本子集组成。这两个子集的组合可以被认为是用于化学建模的清洁数据集。在该方法中,避免了变量选择和异常值检测顺序的问题。我们使用一个近红外光谱(NIR)数据集和一个定量构效关系(QSAR)数据集来测试该方法。结果表明,MASS 是在建立预测模型之前进行数据清理的有用方法。