International Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772-00100, Nairobi, Kenya.
Department of Statistics, Jomo Kenyatta University of Agriculture and Technology, P.O. Box 62000-00200, Nairobi, Kenya.
Sci Rep. 2022 May 3;12(1):7208. doi: 10.1038/s41598-022-11258-w.
Analysis of landmark-based morphometric measurements taken on body parts of insects have been a useful taxonomic approach alongside DNA barcoding in insect identification. Statistical analysis of morphometrics have largely been dominated by traditional methods and approaches such as principal component analysis (PCA), canonical variate analysis (CVA) and discriminant analysis (DA). However, advancement in computing power creates a paradigm shift to apply modern tools such as machine learning. Herein, we assess the predictive performance of four machine learning classifiers; K-nearest neighbor (KNN), random forest (RF), support vector machine (the linear, polynomial and radial kernel SVMs) and artificial neural network (ANNs) on fruit fly morphometrics that were previously analysed using PCA and CVA. KNN and RF performed poorly with overall model accuracy lower than "no-information rate" (NIR) (p value > 0.1). The SVM models had a predictive accuracy of > 95%, significantly higher than NIR (p < 0.001), Kappa > 0.78 and area under curve (AUC) of the receiver operating characteristics was > 0.91; while ANN model had a predictive accuracy of 96%, significantly higher than NIR, Kappa of 0.83 and AUC was 0.98. Wing veins 2, 3, 8, 10, 14 and tibia length were of higher importance than other variables based on both SVM and ANN models. We conclude that SVM and ANN models could be used to discriminate fruit fly species based on wing vein and tibia length measurements or any other morphologically similar pest taxa. These algorithms could be used as candidates for developing an integrated and smart application software for insect discrimination and identification. Variable importance analysis results in this study would be useful for future studies for deciding what must be measured.
基于昆虫身体部位的地标形态测量分析,结合 DNA 条码技术,已成为昆虫鉴定的一种有用的分类方法。形态计量学的统计分析主要由传统方法和方法主导,如主成分分析(PCA)、典范变量分析(CVA)和判别分析(DA)。然而,计算能力的进步使得应用现代工具(如机器学习)成为可能。在这里,我们评估了四种机器学习分类器的预测性能;K 近邻(KNN)、随机森林(RF)、支持向量机(线性、多项式和径向核 SVM)和人工神经网络(ANNs),它们之前用于分析 PCA 和 CVA 分析的果蝇形态计量学。KNN 和 RF 的整体模型精度低于“无信息率”(NIR)(p 值>0.1),表现不佳。SVM 模型的预测精度>95%,显著高于 NIR(p<0.001),Kappa>0.78,接收者操作特征曲线下面积(AUC)>0.91;而 ANN 模型的预测精度为 96%,显著高于 NIR,Kappa 为 0.83,AUC 为 0.98。基于 SVM 和 ANN 模型,第二、三、八、十、十四翅脉和胫骨长度比其他变量更为重要。我们得出结论,SVM 和 ANN 模型可用于基于翅脉和胫骨长度测量或任何其他形态相似的害虫分类来区分果蝇种类。这些算法可以作为开发昆虫识别和鉴定的集成和智能应用软件的候选算法。本研究的变量重要性分析结果将有助于未来的研究,以决定必须测量哪些变量。