基于红外光谱结合特征选择和堆叠泛化的 和其相关种的鉴别。

Discrimination of and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization.

机构信息

Yunnan Herbal Laboratory, Institute of Herb Biotic Resources, School of Life and Sciences, Yunnan University, Kunming 650091, China.

The International Joint Research Center for Sustainable Utilization of Cordyceps Bioresources in China (Yunnan) and Southeast Asia, Yunnan University, Kunming 650091, China.

出版信息

Molecules. 2020 Mar 23;25(6):1442. doi: 10.3390/molecules25061442.

Abstract

, which is one of the largest genera of Gentianoideae, most of which had potential pharmaceutical value, and applied to local traditional medical treatment. Because of the phytochemical diversity and difference of bioactive compounds among species, which makes it crucial to accurately identify authentic species. In this paper, the feasibility of using the infrared spectroscopy technique combined with chemometrics analysis to identify and its related species was studied. A total of 180 batches of raw spectral fingerprints were obtained from 18 species of and by near-infrared (NIR: 10,000-4000 cm) and Fourier transform mid-infrared (MIR: 4000-600 cm) spectrum. Firstly, principal component analysis (PCA) was utilized to explore the natural grouping of the 180 samples. Secondly, random forests (RF), support vector machine (SVM), and K-nearest neighbors (KNN) models were built while using full spectra (including 1487 NIR variables and 1214 FT-MIR variables, respectively). The MIR-SVM model had a higher classification accuracy rate than the other models that were based on the results of the calibration sets and prediction sets. The five feature selection strategies, VIP (variable importance in the projection), Boruta, GARF (genetic algorithm combined with random forest), GASVM (genetic algorithm combined with support vector machine), and Venn diagram calculation, were used to reduce the dimensions of the data variable in order to further reduce numbers of variables for modeling. Finally, 101 NIR and 73 FT-MIR bands were selected as the feature variables, respectively. Thirdly, stacking models were built based on the optimal spectral dataset. Most of the stacking models performed better than the full spectra-based models. RF and SVM (as base learners), combined with the SVM meta-classifier, was the optimal stacked generalization strategy. For the SG-Ven-MIR-SVM model, the accuracy (ACC) of the calibration set and validation set were both 100%. Sensitivity (SE), specificity (SP), efficiency (EFF), Matthews correlation coefficient (MCC), and Cohen's kappa coefficient (K) were all 1, which showed that the model had the optimal authenticity identification performance. Those parameters indicated that stacked generalization combined with feature selection is probably an important technique for improving the classification model predictive accuracy and avoid overfitting. The study result can provide a valuable reference for the safety and effectiveness of the clinical application of medicinal .

摘要

,它是龙胆科最大的属之一,其中大多数具有潜在的药用价值,并应用于当地传统医学治疗。由于种间化学成分的多样性和生物活性化合物的差异,准确识别 种变得至关重要。在本文中,研究了使用红外光谱技术结合化学计量学分析来鉴定 和其相关种的可行性。通过近红外(NIR:10000-4000cm)和傅里叶变换中红外(MIR:4000-600cm)光谱,从 18 种 中获得了 180 批原始光谱指纹。首先,利用主成分分析(PCA)探索 180 个样品的自然分组。其次,在使用全谱(分别包含 1487 个 NIR 变量和 1214 个 FT-MIR 变量)的情况下,建立了随机森林(RF)、支持向量机(SVM)和 K 最近邻(KNN)模型。基于校准集和预测集的结果,MIR-SVM 模型的分类准确率高于其他模型。使用 VIP(投影变量重要性)、Boruta、GARF(遗传算法与随机森林相结合)、GASVM(遗传算法与支持向量机相结合)和 Venn 图计算等五种特征选择策略,降低数据变量的维数,以便进一步减少建模变量的数量。最后,选择了 101 个 NIR 和 73 个 FT-MIR 波段作为特征变量。第三,基于最优光谱数据集构建堆叠模型。大多数堆叠模型的性能均优于全谱模型。RF 和 SVM(作为基础学习者)与 SVM 元分类器相结合,是最优的堆叠泛化策略。对于 SG-Ven-MIR-SVM 模型,校准集和验证集的准确率(ACC)均为 100%。灵敏度(SE)、特异性(SP)、效率(EFF)、马修斯相关系数(MCC)和科恩氏 kappa 系数(K)均为 1,表明该模型具有最佳的真实性识别性能。这些参数表明,堆叠泛化结合特征选择可能是提高分类模型预测精度和避免过拟合的重要技术。研究结果可为药用 的临床应用的安全性和有效性提供有价值的参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72a9/7144467/3c596f35026c/molecules-25-01442-g001.jpg

相似文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索