College of Animal Science and Technology, China Agricultural University, Beijing 100193, China.
Yi Chuan. 2024 Jul;46(7):530-539. doi: 10.16288/j.yczz.24-059.
Accurate breed classification is required for the conservation and utilization of farm animal genetic resources. Traditional classification methods mainly rely on phenotypic characterization. However, it is difficult to distinguish between the highly similar breeds due to the challenges in qualifying the phenotypic character. Machine learning algorithms show unique advantages in breed classification using genomic information. To evaluate the classification methods for Chinese cattle breeds, this study utilized genomic SNP data from 213 individuals across seven Chinese local breeds and compared the classification accuracies of three feature selection methods (F value sorting and screening, mRMR, and Relief-F) and three machine learning algorithms (Random Forest, Support Vector Machine, and Naive Bayes). Results showed that: 1) using the F method to screen more than 1500 SNPs, or using the mRMR algorithm to screen more than 1000 SNPs, the SVM classification algorithm can achieve more than 99.47% classification accuracy; 2) the most effective algorithm was SVM, followed by NB, while the best SNP selection method was F and mRMR, followed by Relief-F; 3) species misclassification often occurs between breeds with high similarity. This study demonstrates that machine learning classification models combined with genomic data are effective methods for the classification of local cattle breeds, providing a technical basis for the rapid and accurate classification of cattle breeds in China.
准确的品种分类对于保护和利用家畜遗传资源至关重要。传统的分类方法主要依赖于表型特征。然而,由于表型特征难以定性,高度相似的品种之间难以区分。利用基因组信息进行品种分类时,机器学习算法显示出独特的优势。为了评估中国牛品种的分类方法,本研究利用来自七个中国地方品种的 213 个个体的基因组 SNP 数据,比较了三种特征选择方法(F 值排序和筛选、mRMR 和 Relief-F)和三种机器学习算法(随机森林、支持向量机和朴素贝叶斯)的分类精度。结果表明:1)使用 F 方法筛选超过 1500 个 SNP,或使用 mRMR 算法筛选超过 1000 个 SNP,SVM 分类算法可达到 99.47%以上的分类精度;2)最有效的算法是 SVM,其次是 NB,而最佳 SNP 选择方法是 F 和 mRMR,其次是 Relief-F;3)高度相似的品种之间经常发生物种误分类。本研究表明,结合基因组数据的机器学习分类模型是地方牛品种分类的有效方法,为中国牛品种的快速准确分类提供了技术基础。