使用预处理数据集构建和解释多类识别模型。

Using preprocessed datasets to construct and interpret multiclass identification models.

作者信息

Wang Cong, Fu Yufeng, Wan Ran, Zhao Le, Wang Hongbo, Guo Junwei, Liu Qiang, Li Shan, Ma Shengtao, Wang Zhicai, Huang Wei, Liu Huimin, Yang Song, Nie Cong

机构信息

Key Laboratory of Tobacco Chemistry, Zhengzhou Tobacco Research Institute of China National Tobacco Corporation (CNTC), Zhengzhou, China.

Technology Center, China Tobacco Henan Industrial Co., Ltd., Zhengzhou, China.

出版信息

Front Plant Sci. 2025 Aug 20;16:1597673. doi: 10.3389/fpls.2025.1597673. eCollection 2025.

DOI:10.3389/fpls.2025.1597673

PMID:40909899

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12405167/

Abstract

INTRODUCTION

Image and near-infrared (NIR) spectroscopic data are widely used for constructing analytical models in precision agriculture. While model interpretation can provide valuable insights for quality control and improvement, the inherent ambiguity of individual image pixels or spectral data points often hinders practical interpretability when using raw data directly. Furthermore, the presence of imbalanced datasets can lead to model overfitting and consequently, poor robustness. Therefore, developing alternative approaches for constructing interpretable and robust models using these data types is crucial.

METHODS

This study proposes using preprocessed data-specifically, morphological features extracted from images and chemical component concentrations predicted from NIR spectra-to build multiclass identification models. Combined kernel SVM based models were proposed to identify the rice variety and cultivation region of tobacco. The determination of kernel parameters and percentage of different types of kernel functions were accomplished by PSO, which make the approach self-adaptive. Feature importance and contribution analyses were conducted using Shapley additive explanations (SHAP).

RESULTS

The resulting models demonstrated high robustness and accuracy, achieving classification success rates of 97.9 and 97.4% via n-fold cross validation on rice and tobacco datasets, respectively, and 97.7% on an independent test set (tobacco dataset 2). This analysis identified key variables and elucidated their specific contributions to the model predictions.

DISCUSSION

This study expands the applicability of image and NIR spectroscopic data, offering researchers an effective methodology for investigating factors crucial to the quality control and improvement of agricultural products.

摘要

引言

图像和近红外（NIR）光谱数据广泛用于精准农业中的分析模型构建。虽然模型解释可为质量控制和改进提供有价值的见解，但直接使用原始数据时，单个图像像素或光谱数据点的固有模糊性常常阻碍实际的可解释性。此外，不平衡数据集的存在可能导致模型过拟合，进而导致鲁棒性较差。因此，开发使用这些数据类型构建可解释且鲁棒模型的替代方法至关重要。

方法

本研究建议使用预处理数据——具体而言，从图像中提取的形态特征和从近红外光谱预测的化学成分浓度——来构建多类识别模型。提出了基于组合核支持向量机（SVM）的模型来识别烟草的水稻品种和种植区域。通过粒子群优化算法（PSO）确定核参数和不同类型核函数的百分比，使该方法具有自适应性。使用夏普利加法解释（SHAP）进行特征重要性和贡献分析。

结果

所得模型表现出高鲁棒性和准确性，通过对水稻和烟草数据集进行n折交叉验证，在水稻和烟草数据集上的分类成功率分别达到97.9%和97.4%，在独立测试集（烟草数据集２）上达到97.7%。该分析确定了关键变量，并阐明了它们对模型预测的具体贡献。

讨论

本研究扩展了图像和近红外光谱数据的适用性，为研究人员提供了一种有效的方法来研究对农产品质量控制和改进至关重要的因素。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e84/12405167/76a3f5d3f627/fpls-16-1597673-g001.jpg

相似文献

Using preprocessed datasets to construct and interpret multiclass identification models.使用预处理数据集构建和解释多类识别模型。

Front Plant Sci. 2025 Aug 20;16:1597673. doi: 10.3389/fpls.2025.1597673. eCollection 2025.

Construction and interpretation of tobacco leaf position discrimination model based on interpretable machine learning.基于可解释机器学习的烟叶部位判别模型构建与解读

Front Plant Sci. 2025 Jul 25;16:1619380. doi: 10.3389/fpls.2025.1619380. eCollection 2025.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.稳定机器学习以获得可重复和可解释的结果：一种针对特定个体见解的新型验证方法。

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

Machine learning framework for oxytetracycline removal using nanostructured cupric oxide supported on magnetic chitosan alginate biocomposite.基于磁性壳聚糖海藻酸盐生物复合材料负载纳米结构氧化铜去除土霉素的机器学习框架

Sci Rep. 2025 Jul 18;15(1):26124. doi: 10.1038/s41598-025-11424-w.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Short-Term Memory Impairment短期记忆障碍

Classification of finger movements through optimal EEG channel and feature selection.通过最优脑电图通道和特征选择对手指运动进行分类。

Front Hum Neurosci. 2025 Jul 16;19:1633910. doi: 10.3389/fnhum.2025.1633910. eCollection 2025.

Plug-and-play use of tree-based methods: consequences for clinical prediction modeling.基于树的方法的即插即用：对临床预测模型的影响。

J Clin Epidemiol. 2025 Aug;184:111834. doi: 10.1016/j.jclinepi.2025.111834. Epub 2025 May 19.

XGB-BIF: An XGBoost-Driven Biomarker Identification Framework for Detecting Cancer Using Human Genomic Data.XGB-BIF：一种用于利用人类基因组数据检测癌症的基于XGBoost的生物标志物识别框架。

Int J Mol Sci. 2025 Jun 11;26(12):5590. doi: 10.3390/ijms26125590.

本文引用的文献

Outlier Removal with Weight Penalization and Aggregation: A Robust Variable Selection Method for Enhancing Near-Infrared Spectral Analysis Performance.

Anal Chem. 2025 Apr 8;97(13):7325-7332. doi: 10.1021/acs.analchem.4c07007. Epub 2025 Feb 19.

Machine learning and multiple linear regression models can predict ascorbic acid and polyphenol contents, and antioxidant activity in strawberries.

J Sci Food Agric. 2025 Jan 30;105(2):1159-1169. doi: 10.1002/jsfa.13906. Epub 2024 Sep 18.

Enhancing Multi-species Liver Microsomal Stability Prediction through Artificial Intelligence.通过人工智能增强多物种肝微粒体稳定性预测。

J Chem Inf Model. 2024 Apr 22;64(8):3222-3236. doi: 10.1021/acs.jcim.4c00159. Epub 2024 Mar 18.

Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection.基于序列的蛋白质结晶倾向预测模型，使用机器学习和两级特征选择。

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad319.

A Lightweight convolutional neural network for nicotine prediction in tobacco by near-infrared spectroscopy.一种用于通过近红外光谱法预测烟草中尼古丁含量的轻量级卷积神经网络。

Front Plant Sci. 2023 May 12;14:1138693. doi: 10.3389/fpls.2023.1138693. eCollection 2023.

Just-in-Time Learning-Integrated Partial Least-Squares Strategy for Accurately Predicting 71 Chemical Constituents in Chinese Tobacco by Near-Infrared Spectroscopy.用于通过近红外光谱法准确预测中国烟草中71种化学成分的即时学习集成偏最小二乘法策略

ACS Omega. 2022 Oct 20;7(43):38650-38659. doi: 10.1021/acsomega.2c04139. eCollection 2022 Nov 1.

Exploring critical metabolites of honey peach (Prunus persica (L.) Batsch) from five main cultivation regions in the north of China by UPLC-Q-TOF/MS combined with chemometrics and modeling.采用 UPLC-Q-TOF/MS 结合化学计量学和建模方法探究中国北方五个主要栽培区的水蜜桃（Prunus persica (L.) Batsch）关键代谢产物。

Food Res Int. 2022 Jul;157:111213. doi: 10.1016/j.foodres.2022.111213. Epub 2022 Apr 4.

Development and Validation of an Efficient MRI Radiomics Signature for Improving the Predictive Performance of 1p/19q Co-Deletion in Lower-Grade Gliomas.一种用于提高低级别胶质瘤中1p/19q共缺失预测性能的高效MRI影像组学特征的开发与验证

Cancers (Basel). 2021 Oct 27;13(21):5398. doi: 10.3390/cancers13215398.

All Models are Wrong, but are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously.所有模型都是有缺陷的，但都是有用的：通过同时研究一整个类别的预测模型来了解变量的重要性。

J Mach Learn Res. 2019;20.

Extracts of Moringa oleifera leaves from different cultivation regions show both antioxidant and antiobesity activities.来自不同种植区域的辣木叶提取物具有抗氧化和抗肥胖活性。

J Food Biochem. 2020 Jul;44(7):e13282. doi: 10.1111/jfbc.13282. Epub 2020 May 20.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用预处理数据集构建和解释多类识别模型。

Using preprocessed datasets to construct and interpret multiclass identification models.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION

引言

方法

结果

讨论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献