Adegbenjo Adeyemi O, Ngadi Michael O
Department of Bioresource Engineering, McGill University, 21111 Lakeshore Road, Ste-Anne-de-Bellevue, Montreal, QC H9X 3V9, Canada.
Process Quality Engineering, School of Engineering and Technology, Conestoga College Institute of Technology and Advanced Learning, 299 Doon Valley Drive, Kitchener, ON N2G 4M4, Canada.
Foods. 2024 Oct 17;13(20):3300. doi: 10.3390/foods13203300.
Imbalanced data situations exist in most fields of endeavor. The problem has been identified as a major bottleneck in machine learning/data mining and is becoming a serious issue of concern in food processing applications. Inappropriate analysis of agricultural and food processing data was identified as limiting the robustness of predictive models built from agri-food applications. As a result of rare cases occurring infrequently, classification rules that detect small groups are scarce, so samples belonging to small classes are largely misclassified. Most existing machine learning algorithms including the K-means, decision trees, and support vector machines (SVMs) are not optimal in handling imbalanced data. Consequently, models developed from the analysis of such data are very prone to rejection and non-adoptability in real industrial and commercial settings. This paper showcases the reality of the imbalanced data problem in agri-food applications and therefore proposes some state-of-the-art artificial intelligence algorithm approaches for handling the problem using methods including data resampling, one-class learning, ensemble methods, feature selection, and deep learning techniques. This paper further evaluates existing and newer metrics that are well suited for handling imbalanced data. Rightly analyzing imbalanced data from food processing application research works will improve the accuracy of results and model developments. This will consequently enhance the acceptability and adoptability of innovations/inventions.
数据不平衡的情况存在于大多数领域。这个问题已被视为机器学习/数据挖掘中的一个主要瓶颈,并且在食品加工应用中正成为一个备受关注的严重问题。对农业和食品加工数据的不当分析被认为限制了基于农业食品应用构建的预测模型的稳健性。由于罕见情况很少发生,检测小群体的分类规则稀缺,因此属于小类别的样本大多被错误分类。包括K均值、决策树和支持向量机(SVM)在内的大多数现有机器学习算法在处理不平衡数据方面并非最优。因此,基于此类数据分析开发的模型在实际工业和商业环境中非常容易被拒绝且不被采用。本文展示了农业食品应用中数据不平衡问题的现状,因此提出了一些使用数据重采样、单类学习、集成方法、特征选择和深度学习技术等方法来处理该问题的先进人工智能算法方法。本文还评估了适用于处理不平衡数据的现有和更新的指标。正确分析来自食品加工应用研究工作的不平衡数据将提高结果和模型开发的准确性。这将因此提高创新/发明的可接受性和可采用性。