处理农业食品数据分析中的不平衡问题。

Handling the Imbalanced Problem in Agri-Food Data Analysis.

作者信息

Adegbenjo Adeyemi O, Ngadi Michael O

机构信息

Department of Bioresource Engineering, McGill University, 21111 Lakeshore Road, Ste-Anne-de-Bellevue, Montreal, QC H9X 3V9, Canada.

Process Quality Engineering, School of Engineering and Technology, Conestoga College Institute of Technology and Advanced Learning, 299 Doon Valley Drive, Kitchener, ON N2G 4M4, Canada.

出版信息

Foods. 2024 Oct 17;13(20):3300. doi: 10.3390/foods13203300.

DOI:10.3390/foods13203300

PMID:39456362

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11507408/

Abstract

Imbalanced data situations exist in most fields of endeavor. The problem has been identified as a major bottleneck in machine learning/data mining and is becoming a serious issue of concern in food processing applications. Inappropriate analysis of agricultural and food processing data was identified as limiting the robustness of predictive models built from agri-food applications. As a result of rare cases occurring infrequently, classification rules that detect small groups are scarce, so samples belonging to small classes are largely misclassified. Most existing machine learning algorithms including the K-means, decision trees, and support vector machines (SVMs) are not optimal in handling imbalanced data. Consequently, models developed from the analysis of such data are very prone to rejection and non-adoptability in real industrial and commercial settings. This paper showcases the reality of the imbalanced data problem in agri-food applications and therefore proposes some state-of-the-art artificial intelligence algorithm approaches for handling the problem using methods including data resampling, one-class learning, ensemble methods, feature selection, and deep learning techniques. This paper further evaluates existing and newer metrics that are well suited for handling imbalanced data. Rightly analyzing imbalanced data from food processing application research works will improve the accuracy of results and model developments. This will consequently enhance the acceptability and adoptability of innovations/inventions.

摘要

数据不平衡的情况存在于大多数领域。这个问题已被视为机器学习/数据挖掘中的一个主要瓶颈，并且在食品加工应用中正成为一个备受关注的严重问题。对农业和食品加工数据的不当分析被认为限制了基于农业食品应用构建的预测模型的稳健性。由于罕见情况很少发生，检测小群体的分类规则稀缺，因此属于小类别的样本大多被错误分类。包括K均值、决策树和支持向量机（SVM）在内的大多数现有机器学习算法在处理不平衡数据方面并非最优。因此，基于此类数据分析开发的模型在实际工业和商业环境中非常容易被拒绝且不被采用。本文展示了农业食品应用中数据不平衡问题的现状，因此提出了一些使用数据重采样、单类学习、集成方法、特征选择和深度学习技术等方法来处理该问题的先进人工智能算法方法。本文还评估了适用于处理不平衡数据的现有和更新的指标。正确分析来自食品加工应用研究工作的不平衡数据将提高结果和模型开发的准确性。这将因此提高创新/发明的可接受性和可采用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/91ae/11507408/6d2a2981fced/foods-13-03300-g001.jpg

相似文献

Handling the Imbalanced Problem in Agri-Food Data Analysis.处理农业食品数据分析中的不平衡问题。

Foods. 2024 Oct 17;13(20):3300. doi: 10.3390/foods13203300.

Inverse free reduced universum twin support vector machine for imbalanced data classification.用于不平衡数据分类的逆自由约简全域孪生支持向量机

Neural Netw. 2023 Jan;157:125-135. doi: 10.1016/j.neunet.2022.10.003. Epub 2022 Oct 15.

Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs.带有同等或不等误分类代价的不平衡数据分类的近贝叶斯支持向量机。

Neural Netw. 2015 Oct;70:39-52. doi: 10.1016/j.neunet.2015.06.005. Epub 2015 Jul 8.

Clinical Decision Support Systems: From the Perspective of Small and Imbalanced Data Set.临床决策支持系统：从小规模和不均衡数据集的视角来看

Stud Health Technol Inform. 2019 Jul 4;262:344-347. doi: 10.3233/SHTI190089.

Affinity and class probability-based fuzzy support vector machine for imbalanced data sets.基于亲和力和类概率的模糊支持向量机在不平衡数据集上的应用。

Neural Netw. 2020 Feb;122:289-307. doi: 10.1016/j.neunet.2019.10.016. Epub 2019 Nov 2.

Ensemble Feature Learning of Genomic Data Using Support Vector Machine.使用支持向量机的基因组数据集成特征学习

PLoS One. 2016 Jun 15;11(6):e0157330. doi: 10.1371/journal.pone.0157330. eCollection 2016.

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm.基于稳健相关冗余和二进制沙蝇优化算法的高维不平衡生物医学数据特征选择。

Genes (Basel). 2020 Jun 27;11(7):717. doi: 10.3390/genes11070717.

Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model.通过结合改进的大趋势扩散和装袋极限学习机模型的新型混合采样，改进不平衡医学数据集的支持向量机分类。

Math Biosci Eng. 2023 Sep 15;20(10):17672-17701. doi: 10.3934/mbe.2023786.

A dynamic ensemble framework for mining textual streams with class imbalance.一种用于挖掘具有类别不平衡的文本流的动态集成框架。

ScientificWorldJournal. 2014;2014:497354. doi: 10.1155/2014/497354. Epub 2014 Apr 10.

Online sequential class-specific extreme learning machine for binary imbalanced learning.在线序贯类特定极端学习机用于二进制不平衡学习。

Neural Netw. 2019 Nov;119:235-248. doi: 10.1016/j.neunet.2019.08.018. Epub 2019 Aug 23.

本文引用的文献

An Adaptive Partial Least-Squares Regression Approach for Classifying Chicken Egg Fertility by Hyperspectral Imaging.一种基于高光谱成像的用于鸡蛋受精率分类的自适应偏最小二乘回归方法。

Sensors (Basel). 2024 Feb 24;24(5):1485. doi: 10.3390/s24051485.

Network intrusion detection using oversampling technique and machine learning algorithms.使用过采样技术和机器学习算法的网络入侵检测

PeerJ Comput Sci. 2022 Jan 7;8:e820. doi: 10.7717/peerj-cs.820. eCollection 2022.

Machine learning techniques for analysis of hyperspectral images to determine quality of food products: A review.用于分析高光谱图像以确定食品质量的机器学习技术：综述

Curr Res Food Sci. 2021 Feb 3;4:28-44. doi: 10.1016/j.crfs.2021.01.002. eCollection 2021.

Non-Destructive Assessment of Chicken Egg Fertility.鸡蛋新鲜度的无损评估。

Sensors (Basel). 2020 Sep 28;20(19):5546. doi: 10.3390/s20195546.

Hyperspectral imaging for accurate determination of rice variety using a deep learning network with multi-feature fusion.利用深度学习网络和多特征融合进行高光谱成像，准确测定水稻品种。

Spectrochim Acta A Mol Biomol Spectrosc. 2020 Jun 15;234:118237. doi: 10.1016/j.saa.2020.118237. Epub 2020 Mar 6.

Adaptive Chunk-Based Dynamic Weighted Majority for Imbalanced Data Streams With Concept Drift.用于处理带有概念漂移的不平衡数据流的基于自适应块的动态加权多数算法

IEEE Trans Neural Netw Learn Syst. 2020 Aug;31(8):2764-2778. doi: 10.1109/TNNLS.2019.2951814. Epub 2019 Dec 5.

Using MetaboAnalyst 4.0 for Comprehensive and Integrative Metabolomics Data Analysis.使用MetaboAnalyst 4.0进行全面综合的代谢组学数据分析。

Curr Protoc Bioinformatics. 2019 Dec;68(1):e86. doi: 10.1002/cpbi.86.

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.在不平衡数据集上评估二元分类器时，精确率-召回率曲线比ROC曲线更具信息性。

PLoS One. 2015 Mar 4;10(3):e0118432. doi: 10.1371/journal.pone.0118432. eCollection 2015.

Translational biomarker discovery in clinical metabolomics: an introductory tutorial.临床代谢组学中的转化生物标志物发现：入门教程

Metabolomics. 2013 Apr;9(2):280-299. doi: 10.1007/s11306-012-0482-9. Epub 2012 Dec 4.

Early detection of toxigenic fungi on maize by hyperspectral imaging analysis.利用高光谱成像分析技术早期检测玉米中的产毒真菌。

Int J Food Microbiol. 2010 Nov 15;144(1):64-71. doi: 10.1016/j.ijfoodmicro.2010.08.001. Epub 2010 Aug 13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

处理农业食品数据分析中的不平衡问题。

Handling the Imbalanced Problem in Agri-Food Data Analysis.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献