Li Weinan, Zhang Mingjun, Fan Jingchao, Yang Zhaoen, Peng Jun, Zhang Jianhua, Lan Yubin, Chai Mao
Nanfan Research Institute, Chinese Academy of Agricultural Sciences, Sanya, 572024, Hainan, China.
College of Electronic Engineering (College of Artificial Intelligence), South China Agricultural University, Guangzhou, 510642, Guangdong, China.
Theor Appl Genet. 2025 Jan 24;138(1):36. doi: 10.1007/s00122-025-04821-2.
Cotton is an important crop for fiber production, but the genetic basis underlying key agronomic traits, such as fiber quality and flowering days, remains complex. While machine learning (ML) has shown great potential in uncovering the genetic architecture of complex traits in other crops, its application in cotton has been limited. Here, we applied five machine learning models-AdaBoost, Gradient Boosting Regressor, LightGBM, Random Forest, and XGBoost-to identify loci associated with fiber quality and flowering days in cotton. We compared two SNP dataset down-sampling methods for model training and found that selecting SNPs with an Fscale value greater than 0 outperformed randomly selected SNPs in terms of model accuracy. We further performed machine learning quantitative trait loci (mlQTLs) analysis for 13 traits related to fiber quality and flowering days. These mlQTLs were then compared to those identified through genome-wide association studies (GWAS), revealing that the machine learning approach not only confirmed known loci but also identified novel QTLs. Additionally, we evaluated the effect of population size on model accuracy and found that larger population sizes resulted in better predictive performance. Finally, we proposed candidate genes for the identified mlQTLs, including two argonaute 5 proteins, Gh_A09G104100 and Gh_A09G104400, for the FL3/FS2 locus, as well as GhFLA17 and Syntaxin-121 (Gh_D09G143700) for the FSD09_2/FED09_2 locus. Our findings demonstrate the efficacy of machine learning in enhancing the identification of genetic loci in cotton, providing valuable insights for improving cotton breeding strategies.
棉花是纤维生产的重要作物,但诸如纤维品质和开花天数等关键农艺性状的遗传基础仍然复杂。虽然机器学习(ML)在揭示其他作物复杂性状的遗传结构方面已显示出巨大潜力,但其在棉花中的应用却很有限。在此,我们应用了五种机器学习模型——AdaBoost、梯度提升回归器、LightGBM、随机森林和XGBoost——来识别棉花中与纤维品质和开花天数相关的基因座。我们比较了两种用于模型训练的单核苷酸多态性(SNP)数据集下采样方法,发现选择Fscale值大于0的SNP在模型准确性方面优于随机选择的SNP。我们进一步对与纤维品质和开花天数相关的13个性状进行了机器学习数量性状基因座(mlQTL)分析。然后将这些mlQTL与通过全基因组关联研究(GWAS)鉴定出的基因座进行比较,结果表明机器学习方法不仅证实了已知基因座,还鉴定出了新的QTL。此外,我们评估了群体大小对模型准确性的影响,发现较大的群体大小会带来更好的预测性能。最后,我们为鉴定出的mlQTL提出了候选基因,包括FL3/FS2基因座的两个AGO5蛋白Gh_A09G104100和Gh_A09G104400,以及FSD09_2/FED09_2基因座的GhFLA17和Syntaxin-121(Gh_D09G143700)。我们的研究结果证明了机器学习在增强棉花遗传基因座鉴定方面的有效性,为改进棉花育种策略提供了有价值的见解。