Shi Yan, Liu Wei, Wei Feng, Ma Shuang-Cheng
National Institutes for Food and Drug Control Beijing 102629,China.
Zhongguo Zhong Yao Za Zhi. 2023 Aug;48(16):4370-4380. doi: 10.19540/j.cnki.cjcmm.20230427.302.
This study aimed to establish a method based on machine learning technology for accurately predicting the commodity specifications of Fritillariae Cirrhosae Bulbus and explore the application of data augmentation technology in the field of drug analysis. The correlation optimized warping(COW) algorithm was used to perform peak calibration on the UPLC-QDA multi-channel superimposed data of 30 batches of samples, and the data were normalized. Through unsupervised learning methods such as clustering analysis, principal component analysis(PCA), and correlation analysis, the general characteristics of the data were understood. Then, the logistic regression algorithm was used for supervised learning on the data, and the condition tabular generative adversarial networks(CTGAN) was used to generate a large amount of data. Logistic regression classification models were trained separately using the real data and the data generated by CTGAN, and these models were evaluated. The logistic regression model trained with real data achieved cross-validation and test set accuracies of 0.95 and 1.00, respectively, while the logistic regression model trained with both real and CTGAN-generated data achieved cross-validation and test set accuracies of 0.99 and 1.00, respectively. The results indicate that machine learning can accurately predict the classification of Songbei, Qingbei, and Lubeibased on UPLC-QDA detection data. CTGAN-generated data can partially compensate for the lack of data in drug analysis, improving the accuracy and predictive ability of machine learning models.
本研究旨在建立一种基于机器学习技术的方法,用于准确预测川贝母的商品规格,并探索数据增强技术在药物分析领域的应用。采用相关优化翘曲(COW)算法对30批次样品的超高效液相色谱-四极杆飞行时间质谱(UPLC-QDA)多通道叠加数据进行峰校准,并对数据进行归一化处理。通过聚类分析、主成分分析(PCA)和相关分析等无监督学习方法,了解数据的一般特征。然后,使用逻辑回归算法对数据进行监督学习,并使用条件表格生成对抗网络(CTGAN)生成大量数据。分别使用真实数据和CTGAN生成的数据训练逻辑回归分类模型,并对这些模型进行评估。使用真实数据训练的逻辑回归模型的交叉验证和测试集准确率分别为0.95和1.00,而使用真实数据和CTGAN生成的数据训练的逻辑回归模型的交叉验证和测试集准确率分别为0.99和1.00。结果表明,机器学习可以基于UPLC-QDA检测数据准确预测松贝、青贝和炉贝的分类。CTGAN生成的数据可以部分弥补药物分析中数据的不足,提高机器学习模型的准确性和预测能力。