Lopez Eneko, Etxebarria-Elezgarai Jaione, Amigo Jose Manuel, Seifert Andreas
CIC NanoGUNE BRTA, Tolosa Hiribidea 76, San Sebastián, 20018, Spain; Department of Physics, University of the Basque Country (UPV/EHU), San Sebastián, 20018, Spain.
CIC NanoGUNE BRTA, Tolosa Hiribidea 76, San Sebastián, 20018, Spain.
Anal Chim Acta. 2023 Sep 22;1275:341532. doi: 10.1016/j.aca.2023.341532. Epub 2023 Jun 17.
Machine learning is the art of combining a set of measurement data and predictive variables to forecast future events. Every day, new model approaches (with high levels of sophistication) can be found in the literature. However, less importance is given to the crucial stage of validation. Validation is the assessment that the model reliably links the measurements and the predictive variables. Nevertheless, there are many ways in which a model can be validated and cross-validated reliably, but still, it may be a model that wrongly reflects the real nature of the data and cannot be used to predict external samples. This manuscript shows in a didactical manner how important the data structure is when a model is constructed and how easy it is to obtain models that look promising with wrong-designed cross-validation and external validation strategies. A comprehensive overview of the main validation strategies is shown, exemplified by three different scenarios, all of them focused on classification.
机器学习是一门将一组测量数据和预测变量相结合以预测未来事件的技术。每天,我们都能在文献中找到新的(高度复杂的)模型方法。然而,验证这个关键阶段却未得到足够重视。验证是对模型能否可靠地将测量值与预测变量联系起来的评估。尽管如此,有许多方法可以可靠地对模型进行验证和交叉验证,但即便如此,它仍可能是一个错误反映数据真实本质且无法用于预测外部样本的模型。本手稿以一种教学的方式展示了在构建模型时数据结构的重要性,以及采用设计错误的交叉验证和外部验证策略是多么容易得到看似有前景的模型。文中给出了主要验证策略的全面概述,并以三种不同场景为例进行说明,所有场景均聚焦于分类。