Department of Biotechnology, Sangmyung University, Seoul 03016, the Republic of Korea.
Institute of Intelligence Informatics Technology, Sangmyung University, Seoul 03016, the Republic of Korea.
Forensic Sci Int Genet. 2024 Jul;71:103061. doi: 10.1016/j.fsigen.2024.103061. Epub 2024 May 22.
Poppies are beneficial plants with a variety of applications, including medicinal, edible, ornamental, and industrial purposes. Some Papaver species are forensically significant plants because they contain opium, a narcotic substance. Internationally trafficked species of illegal poppies are being identified by DNA barcoding employing multiple markers in response to their forensic value. However, effective markers for precise species identification of legal and illegal poppies are still under discussion, with research on illegal poppies focusing on Papaver somniferum L., and species identification studies of Papaver bracteatum and Papaver setigerum DC. still lacking. As a result, in order to evaluate the performance of genetic markers and classify their DNA sequences in the genus Papaver, this study developed the first machine learning-based two-layer model, in which the first layer classifies legal and illegal poppies from the given sequence and the second layer identifies species of illegal poppies using their sequences. We constructed the dataset and investigated biological features from four markers, internal transcribed spacer 1 (ITS1), internal transcribed spacer 2 (ITS2), transfer RNA Leucine (trnL), transfer RNA Leucine - transfer RNA Phenylalanine intergenic spacer (trnL-trnF intergenic spacer) and their combination, using four machine learning algorithms, K-nearest neighbor (KNN), Naïve Bayes (NB), extreme gradient boost (XGBoost) and Random Forest (RF). According to our findings, for Layer 1 to classify legal and illegal poppies, KNN-based models using combined ITS region achieved the greatest performance of accuracy 0.846 and 0.889 using training and test sets, respectively. Additionally, for Layer 2 to identify illegal poppy species, KNN-based models using combined ITS region achieved the best performance of 0.833 and 1.000 for using training and test sets, respectively. To validate the model, the combined ITS region, which includes ITS 1 and 2 sequences, from blind poppy samples were used as a case study, with the Layer 1 correctly classifying legal and illegal poppies with over 0.830 accuracy. Layer 2 correctly identified P. setigerum DC., however, only one of the three P. somniferum L. species was accurately identified. Nevertheless, our research shows that machine learning can be used to classify and identify legal and illegal poppy species using DNA barcodes which can then be used as an efficient and effective forensic tool for improved law enforcement and a safer society.
罂粟是一种有益的植物,具有多种用途,包括药用、食用、观赏和工业用途。一些罂粟属物种因其含有鸦片而具有法医学意义,鸦片是一种麻醉物质。国际贩运的非法罂粟物种正在通过使用多个标记物的 DNA 条形码技术进行鉴定,以应对其法医学价值。然而,对于准确鉴定合法和非法罂粟物种的有效标记物仍在讨论中,对非法罂粟的研究主要集中在罂粟属的罂粟上,而对罂粟属植物 bracteatum 和罂粟属植物 setigerum DC. 的物种鉴定研究仍缺乏。因此,为了评估遗传标记物的性能并对罂粟属的 DNA 序列进行分类,本研究开发了第一个基于机器学习的两层模型,其中第一层从给定序列中分类合法和非法罂粟,第二层使用序列识别非法罂粟的物种。我们构建了数据集,并使用四个机器学习算法(K 近邻 (KNN)、朴素贝叶斯 (NB)、极端梯度提升 (XGBoost) 和随机森林 (RF))从四个标记物(内部转录间隔区 1(ITS1)、内部转录间隔区 2(ITS2)、转移 RNA 亮氨酸(trnL)、转移 RNA 亮氨酸-转移 RNA 苯丙氨酸基因间隔区(trnL-trnF 基因间隔区)及其组合中研究生物特征。根据我们的研究结果,对于第一层,使用基于 KNN 的模型对合法和非法罂粟进行分类,使用组合 ITS 区的模型在训练集和测试集上分别实现了最高的准确性 0.846 和 0.889。此外,对于第二层,使用基于 KNN 的模型对非法罂粟物种进行识别,使用组合 ITS 区的模型在训练集和测试集上分别实现了最佳性能 0.833 和 1.000。为了验证模型,我们使用盲样罂粟的组合 ITS 区(包含 ITS1 和 2 序列)作为案例研究,第一层正确分类合法和非法罂粟的准确率超过 0.830。第二层正确识别了罂粟属植物 setigerum DC.,但仅准确识别了三个罂粟属植物 L.物种中的一个。然而,我们的研究表明,机器学习可用于使用 DNA 条形码对合法和非法罂粟物种进行分类和鉴定,然后可将其用作改进执法和建设更安全社会的有效法医工具。