Riza Lala Septem, Zain Muhammad Iqbal, Izzuddin Ahmad, Prasetyo Yudi, Hidayat Topik, Abu Samah Khyrina Airin Fariza
Department of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, Indonesia.
Department of Biology Education, Universitas Pendidikan Indonesia, Bandung, Indonesia.
Heliyon. 2023 Sep 21;9(10):e20161. doi: 10.1016/j.heliyon.2023.e20161. eCollection 2023 Oct.
The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research.
DNA条形码技术已在分类学和系统发育学中得到广泛应用。某些DNA序列的差异能够区分并帮助将生物体分类到不同的分类单元中。它已被用于形态学本身不足以解决分类争议的情况。本研究旨在利用层次聚类这一无监督机器学习方法来确定和解决植物科分类学中的争议。我们以豆科为例进行研究,历史上一些人将其分为三个科(豆科、苏木科和含羞草科),但另一些人则将其归为一个科(豆科)。本研究分为几个阶段,即:(i)数据收集,(ii)数据预处理,(iii)找到最佳距离方法,以及(iv)确定有争议的科。所使用的数据从多个来源收集,包括美国国家生物技术信息中心(NCBI)、期刊和网站。用于验证方法的数据从NCBI收集。这用于确定区分科或属的最佳距离方法。豆科组案例研究的数据从期刊和一个网站收集。从我们进行的实验中,我们发现皮尔逊方法是对植物ITS序列进行聚类的最佳距离方法,在准确性和计算成本方面都是如此。我们使用皮尔逊方法来确定豆科中有争议的科。根据我们的研究,我们发现豆科的案例研究应归为一个科。