分类框架中具有标签噪声和异常值的稳健变量选择：在农业食品光谱数据中的应用。

Robust variable selection in the framework of classification with label noise and outliers: Applications to spectroscopic data in agri-food.

机构信息

Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy.

Univ. Lille, CNRS, UMR 8516, LASIRE-Laboratoire avancé de spectroscopie pour les interactions, la réactivité et l'environnement, F-59000, Lille, France.

出版信息

Anal Chim Acta. 2021 Apr 8;1153:338245. doi: 10.1016/j.aca.2021.338245. Epub 2021 Feb 1.

DOI:10.1016/j.aca.2021.338245

PMID:33714445

Abstract

Classification of high-dimensional spectroscopic data is a common task in analytical chemistry. Well-established procedures like support vector machines (SVMs) and partial least squares discriminant analysis (PLS-DA) are the most common methods for tackling this supervised learning problem. Nonetheless, interpretation of these models remains sometimes difficult, and solutions based on feature selection are often adopted as they lead to the automatic identification of the most informative wavelengths. Unfortunately, for some delicate applications like food authenticity, mislabeled and adulterated spectra occur both in the calibration and/or validation sets, with dramatic effects on the model development, its prediction accuracy and robustness. Motivated by these issues, the present paper proposes a robust model-based method that simultaneously performs variable selection, outliers and label noise detection. We demonstrate the effectiveness of our proposal in dealing with three agri-food spectroscopic studies, where several forms of perturbations are considered. Our approach succeeds in diminishing problem complexity, identifying anomalous spectra and attaining competitive predictive accuracy considering a very low number of selected wavelengths.

摘要

高维光谱数据的分类是分析化学中的一项常见任务。支持向量机 (SVM) 和偏最小二乘判别分析 (PLS-DA) 等成熟的方法是解决这种有监督学习问题的最常用方法。然而，这些模型的解释有时仍然很困难，并且通常采用基于特征选择的解决方案，因为它们可以自动识别信息量最大的波长。不幸的是，对于一些像食品真实性这样微妙的应用，在定标和/或验证集中都会出现有错误标签和掺假的光谱，这对模型的开发、其预测准确性和稳健性产生了巨大的影响。鉴于这些问题，本文提出了一种稳健的基于模型的方法，该方法可以同时执行变量选择、异常值和标签噪声检测。我们通过三个农业食品光谱研究来证明我们的方法的有效性，其中考虑了几种形式的扰动。我们的方法成功地降低了问题的复杂性，识别了异常光谱，并在考虑非常少的选择波长的情况下获得了有竞争力的预测准确性。