Hong Xue-Zhen, Fu Xian-Shu, Wang Zheng-Liang, Zhang Li, Yu Xiao-Ping, Ye Zi-Hong
College of Quality & Safety Engineering, China Jiliang University, Xueyuan Street, Xiasha Higher Education District, Hangzhou 310018, China.
BioCircuits Institute, University of California, La Jolla, San Diego, CA 92093, USA.
J Anal Methods Chem. 2019 Jan 3;2019:1537568. doi: 10.1155/2019/1537568. eCollection 2019.
This work presents a reliable approach to trace teas' geographical origins despite changes in teas caused by different harvest years. A total of 1447 tea samples collected from various areas in 2014 (660 samples) and 2015 (787 samples) were detected by FT-NIR. Seven classifiers trained on the 2014 dataset all succeeded to trace origins of samples collected in 2014; however, they all failed to predict origins for the 2015 samples due to different data distributions and imbalanced dataset. Three outlier detection based undersampling approaches-one-class SVM (OC-SVM), isolation forest and elliptic envelope-were then proposed; as a result, the highest macro average recall (MAR) for the 2015 dataset was improved from 56.86% to 73.95% (by SVM). A model updating approach was also applied, and the prediction MAR was significantly improved with increase in the updating rate. The best MAR (90.31%) was first achieved by the OC-SVM combined SVM classifier at a 50% rate.
这项工作提出了一种可靠的方法,能够追溯茶叶的地理来源,即便不同收获年份的茶叶会有所变化。通过傅里叶变换近红外光谱(FT-NIR)对2014年(660个样本)和2015年(787个样本)从不同地区采集的总共1447个茶叶样本进行了检测。在2014年数据集上训练的七个分类器都成功追溯了2014年采集样本的来源;然而,由于数据分布不同和数据集不均衡,它们都未能预测2015年样本的来源。随后提出了三种基于异常值检测的欠采样方法——单类支持向量机(OC-SVM)、孤立森林和椭圆包络;结果,2015年数据集的最高宏平均召回率(MAR)从56.86%提高到了73.95%(通过支持向量机)。还应用了一种模型更新方法,随着更新率的提高,预测召回率显著提高。OC-SVM联合支持向量机分类器以50%的更新率首次实现了最佳召回率(90.31%)。