预处理、波长选择和回归方法对近红外校准模型性能的联合效应的比较研究

A Comparative Investigation of the Combined Effects of Pre-Processing, Wavelength Selection, and Regression Methods on Near-Infrared Calibration Model Performance.

作者信息

Wan Jian, Chen Yi-Chieh, Morris A Julian, Thennadil Suresh N

机构信息

1 School of Marine Science and Engineering, Plymouth University, Plymouth, UK.

2 Department of Chemical and Process Engineering, University of Strathclyde, Glasgow, UK.

出版信息

Appl Spectrosc. 2017 Jul;71(7):1432-1446. doi: 10.1177/0003702817694623. Epub 2017 Mar 30.

DOI:10.1177/0003702817694623

PMID:28357879

Abstract

Near-infrared (NIR) spectroscopy is being widely used in various fields ranging from pharmaceutics to the food industry for analyzing chemical and physical properties of the substances concerned. Its advantages over other analytical techniques include available physical interpretation of spectral data, nondestructive nature and high speed of measurements, and little or no need for sample preparation. The successful application of NIR spectroscopy relies on three main aspects: pre-processing of spectral data to eliminate nonlinear variations due to temperature, light scattering effects and many others, selection of those wavelengths that contribute useful information, and identification of suitable calibration models using linear/nonlinear regression . Several methods have been developed for each of these three aspects and many comparative studies of different methods exist for an individual aspect or some combinations. However, there is still a lack of comparative studies for the interactions among these three aspects, which can shed light on what role each aspect plays in the calibration and how to combine various methods of each aspect together to obtain the best calibration model. This paper aims to provide such a comparative study based on four benchmark data sets using three typical pre-processing methods, namely, orthogonal signal correction (OSC), extended multiplicative signal correction (EMSC) and optical path-length estimation and correction (OPLEC); two existing wavelength selection methods, namely, stepwise forward selection (SFS) and genetic algorithm optimization combined with partial least squares regression for spectral data (GAPLSSP); four popular regression methods, namely, partial least squares (PLS), least absolute shrinkage and selection operator (LASSO), least squares support vector machine (LS-SVM), and Gaussian process regression (GPR). The comparative study indicates that, in general, pre-processing of spectral data can play a significant role in the calibration while wavelength selection plays a marginal role and the combination of certain pre-processing, wavelength selection, and nonlinear regression methods can achieve superior performance over traditional linear regression-based calibration.

摘要

近红外（NIR）光谱技术正广泛应用于从制药到食品工业等各个领域，用于分析相关物质的化学和物理性质。与其他分析技术相比，它的优势包括光谱数据具有可用的物理解释、无损检测特性、测量速度快，并且几乎不需要或完全不需要样品制备。近红外光谱技术的成功应用依赖于三个主要方面：光谱数据的预处理以消除由于温度、光散射效应等诸多因素引起的非线性变化；选择那些能提供有用信息的波长；以及使用线性/非线性回归识别合适的校准模型。针对这三个方面中的每一个都已经开发了几种方法，并且对于单个方面或某些组合存在许多不同方法的比较研究。然而，对于这三个方面之间的相互作用仍然缺乏比较研究，而这种研究可以阐明每个方面在校准中所起的作用，以及如何将每个方面的各种方法结合起来以获得最佳校准模型。本文旨在基于四个基准数据集，使用三种典型的预处理方法，即正交信号校正（OSC）、扩展乘法信号校正（EMSC）和光程长度估计与校正（OPLEC）；两种现有的波长选择方法，即逐步向前选择（SFS）和结合光谱数据偏最小二乘回归的遗传算法优化（GAPLSSP）；四种常用的回归方法，即偏最小二乘（PLS）、最小绝对收缩和选择算子（LASSO）、最小二乘支持向量机（LS - SVM）和高斯过程回归（GPR），进行这样的比较研究。比较研究表明，一般来说，光谱数据的预处理在校准中可以发挥重要作用，而波长选择的作用较小，并且某些预处理、波长选择和非线性回归方法的组合可以实现优于传统基于线性回归的校准的性能。