Gerretzen Jan, Szymańska Ewa, Bart Jacob, Davies Antony N, van Manen Henk-Jan, van den Heuvel Edwin R, Jansen Jeroen J, Buydens Lutgarde M C
Radboud University, Institute for Molecules and Materials, Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands; TI-COAST, P.O. Box 18, 6160 MD Geleen, The Netherlands.
AkzoNobel, Supply Chain, Research & Development, Strategic Research Group - Measurement & Analytical Science, Zutphenseweg 10, 7418 AJ Deventer, The Netherlands.
Anal Chim Acta. 2016 Sep 28;938:44-52. doi: 10.1016/j.aca.2016.08.022. Epub 2016 Aug 15.
The aim of data preprocessing is to remove data artifacts-such as a baseline, scatter effects or noise-and to enhance the contextually relevant information. Many preprocessing methods exist to deliver one or more of these benefits, but which method or combination of methods should be used for the specific data being analyzed is difficult to select. Recently, we have shown that a preprocessing selection approach based on Design of Experiments (DoE) enables correct selection of highly appropriate preprocessing strategies within reasonable time frames. In that approach, the focus was solely on improving the predictive performance of the chemometric model. This is, however, only one of the two relevant criteria in modeling: interpretation of the model results can be just as important. Variable selection is often used to achieve such interpretation. Data artifacts, however, may hamper proper variable selection by masking the true relevant variables. The choice of preprocessing therefore has a huge impact on the outcome of variable selection methods and may thus hamper an objective interpretation of the final model. To enhance such objective interpretation, we here integrate variable selection into the preprocessing selection approach that is based on DoE. We show that the entanglement of preprocessing selection and variable selection not only improves the interpretation, but also the predictive performance of the model. This is achieved by analyzing several experimental data sets of which the true relevant variables are available as prior knowledge. We show that a selection of variables is provided that complies more with the true informative variables compared to individual optimization of both model aspects. Importantly, the approach presented in this work is generic. Different types of models (e.g. PCR, PLS, …) can be incorporated into it, as well as different variable selection methods and different preprocessing methods, according to the taste and experience of the user. In this work, the approach is illustrated by using PLS as model and PPRV-FCAM (Predictive Property Ranked Variable using Final Complexity Adapted Models) for variable selection.
数据预处理的目的是去除数据伪像(如基线、散射效应或噪声),并增强上下文相关信息。存在许多预处理方法来实现这些益处中的一项或多项,但针对要分析的特定数据应使用哪种方法或方法组合却难以选择。最近,我们已经表明,基于实验设计(DoE)的预处理选择方法能够在合理的时间范围内正确选择高度合适的预处理策略。在该方法中,重点仅在于提高化学计量模型的预测性能。然而,这只是建模中两个相关标准之一:模型结果的解释可能同样重要。变量选择通常用于实现这种解释。然而,数据伪像可能会通过掩盖真正相关的变量来妨碍正确的变量选择。因此,预处理的选择对变量选择方法的结果有巨大影响,从而可能妨碍对最终模型的客观解释。为了增强这种客观解释,我们在此将变量选择集成到基于DoE的预处理选择方法中。我们表明,预处理选择和变量选择的结合不仅改善了解释,还提高了模型的预测性能。这是通过分析几个实验数据集来实现的,其中真正相关的变量作为先验知识是可用的。我们表明,与分别对模型的两个方面进行优化相比,所提供的变量选择更符合真正的信息变量。重要的是,本文提出的方法是通用的。根据用户的喜好和经验,可以将不同类型的模型(如PCR、PLS等)以及不同的变量选择方法和不同的预处理方法纳入其中。在本文中,通过使用PLS作为模型并使用PPRV - FCAM(使用最终复杂度适配模型的预测属性排名变量)进行变量选择来说明该方法。