Wu Zefeng, Sun Yali, Zhao Xiaoqiang, Liu Zigang, Zhou Wenqi, Niu Yining
State Key Laboratory of Aridland Crop Science, Gansu Agricultural University, No. 1 Yingmen Village, Anning District, Lanzhou 730070, Gansu Province, China.
Crop Research Institute, Gansu Academy of Agricultural Sciences, No. 1, New Village, Anning District, Lanzhou 730070, Gansu Province, China.
NAR Genom Bioinform. 2024 Dec 27;6(4):lqae184. doi: 10.1093/nargab/lqae184. eCollection 2024 Dec.
Research on the dynamic expression of genes in plants is important for understanding different biological processes. We used the large amounts of transcriptomic data from various plant sample sources that are publicly available to investigate whether the expression levels of a subset of highly variable genes (HVGs) can be used to accurately identify the phenotypes of plants. Using maize ( L.) as an example, we built machine learning (ML) models to predict phenotypes using a gene expression dataset of 21 612 bulk RNA sequencing samples. We showed that the ML models achieved excellent prediction accuracy using only the HVGs to identify different phenotypes, including tissue types, developmental stages, cultivars and stress conditions. By ML models, several important functional genes were found to be associated with different phenotypes. We performed a similar analysis in rice ( L.) and found that the ML models could be generalized across species. However, the models trained from maize did not perform well in rice, probably because of the expression divergence of the conserved HVGs between the two species. Overall, our results provide an ML framework for phenotype prediction using gene expression profiles, which may contribute to precision management of crops in agricultural practices.
研究植物基因的动态表达对于理解不同的生物学过程至关重要。我们利用公开可得的来自各种植物样本来源的大量转录组数据,来研究一组高度可变基因(HVGs)的表达水平是否可用于准确识别植物的表型。以玉米(L.)为例,我们构建了机器学习(ML)模型,使用一个包含21612个批量RNA测序样本的基因表达数据集来预测表型。我们表明,ML模型仅使用HVGs就能实现出色的预测准确性,以识别不同的表型,包括组织类型、发育阶段、品种和胁迫条件。通过ML模型,发现了几个重要的功能基因与不同表型相关。我们在水稻(L.)中进行了类似分析,发现ML模型可在不同物种间通用。然而,从玉米训练的模型在水稻中表现不佳,可能是因为这两个物种间保守HVGs的表达存在差异。总体而言,我们的结果提供了一个利用基因表达谱进行表型预测的ML框架,这可能有助于农业实践中作物的精准管理。