Roccetti Marco, Delnevo Giovanni, Casini Luca, Mirri Silvia
Department of Computer Science and Engineering, University of Bologna, Via Mura Anteo Zamboni 7, 40127 Bologna, Italy.
J Big Data. 2021;8(1):39. doi: 10.1186/s40537-021-00428-8. Epub 2021 Feb 25.
models are tools for data analysis suitable for approximating (non-linear) relationships among variables for the best prediction of an outcome. While these models can be used to answer many important questions, their utility is still harshly criticized, being extremely challenging to identify which data are the most adequate to represent a given specific phenomenon of interest. With a recent experience in the development of a deep learning model designed to detect failures in mechanical water meter devices, we have learnt that a sensible deterioration of the prediction accuracy can occur if one tries to train a deep learning model by adding specific device descriptors, based on data. This can happen because of an excessive increase in the dimensions of the data, with a correspondent loss of statistical significance. After several unsuccessful experiments conducted with alternative methodologies that either permit to reduce the data space dimensionality or employ more traditional machine learning algorithms, we changed the training strategy, reconsidering that categorical data, in the light of a . In essence, we used those categorical descriptors, not as an input on which to train our deep learning model, but as a tool to give a new shape to the dataset, based on the rule. With this data adjustment, we trained a more performative deep learning model able to detect defective water meter devices with a prediction accuracy in the range 87-90%, even in the presence of categorical descriptors.
模型是用于数据分析的工具,适用于逼近变量之间的(非线性)关系,以最佳地预测结果。虽然这些模型可用于回答许多重要问题,但其效用仍受到严厉批评,因为要确定哪些数据最足以代表给定的特定感兴趣现象极具挑战性。通过最近开发一个旨在检测机械水表装置故障的深度学习模型的经验,我们了解到,如果基于数据通过添加特定设备描述符来训练深度学习模型,预测准确性可能会明显下降。这可能是因为数据维度过度增加,相应地失去了统计显著性。在用允许降低数据空间维度或采用更传统机器学习算法的替代方法进行了几次不成功的实验后,我们改变了训练策略,根据一种……重新考虑分类数据。本质上,我们不是将那些分类描述符用作训练深度学习模型的输入,而是将其作为一种工具,根据……规则为数据集赋予新的形式。通过这种数据调整,我们训练了一个性能更好的深度学习模型,即使存在分类描述符,该模型也能够以87 - 90%的预测准确率检测出有缺陷的水表装置。