Centre for Genetic Resources, Wageningen University and Research, P.O. Box 16, 6700 AA, Wageningen, The Netherlands.
BMC Bioinformatics. 2021 Mar 31;22(1):173. doi: 10.1186/s12859-021-04018-6.
To address the need for easy and reliable species classification in plant genetic resources collections, we assessed the potential of five classifiers (Random Forest, Neighbour-Joining, 1-Nearest Neighbour, a conservative variety of 3-Nearest Neighbours and Naive Bayes) We investigated the effects of the number of accessions per species and misclassification rate on classification success, and validated theirs generic value results with three complete datasets.
We found the conservative variety of 3-Nearest Neighbours to be the most reliable classifier when varying species representation and misclassification rate. Through the analysis of the three complete datasets, this finding showed generic value. Additionally, we present various options for marker selection for classification taks such as these.
Large-scale genomic data are increasingly being produced for genetic resources collections. These data are useful to address species classification issues regarding crop wild relatives, and improve genebank documentation. Implementation of a classification method that can improve the quality of bad datasets without gold standard training data is considered an innovative and efficient method to improve gene bank documentation.
为满足植物遗传资源收集品中简便可靠的物种分类需求,我们评估了五种分类器(随机森林、邻接法、最近邻法、保守的 3 近邻法和朴素贝叶斯)的潜力。我们研究了每个物种的样本数量和错误分类率对分类成功率的影响,并使用三个完整数据集验证了它们的泛化价值结果。
当物种表现和错误分类率变化时,我们发现保守的 3 近邻法是最可靠的分类器。通过对三个完整数据集的分析,这一发现显示出了泛化价值。此外,我们还为这种分类任务提供了各种标记选择选项。
遗传资源收集品中产生了越来越多的大规模基因组数据。这些数据可用于解决作物野生近缘种的物种分类问题,并改进基因库的文献记载。实施一种分类方法,在没有黄金标准训练数据的情况下提高不良数据集的质量,被认为是一种创新和有效的方法,可以改进基因库的文献记载。