College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, 030024, China.
Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, China, 518000, Shenzhen.
BMC Genomics. 2024 Apr 29;25(1):418. doi: 10.1186/s12864-024-10258-6.
Plant specialized (or secondary) metabolites (PSM), also known as phytochemicals, natural products, or plant constituents, play essential roles in interactions between plants and environment. Although many research efforts have focused on discovering novel metabolites and their biosynthetic genes, the resolution of metabolic pathways and identified biosynthetic genes was limited by rudimentary analysis approaches and enormous number of candidate genes.
Here we integrated state-of-the-art automated machine learning (ML) frame AutoGluon-Tabular and multi-omics data from Arabidopsis to predict genes encoding enzymes involved in biosynthesis of plant specialized metabolite (PSM), focusing on the three main PSM categories: terpenoids, alkaloids, and phenolics. We found that the related features of genomics and proteomics were the top two crucial categories of features contributing to the model performance. Using only these key features, we built a new model in Arabidopsis, which performed better than models built with more features including those related with transcriptomics and epigenomics. Finally, the built models were validated in maize and tomato, and models tested for maize and trained with data from two other species exhibited either equivalent or superior performance to intraspecies predictions.
Our external validation results in grape and poppy on the one hand implied the applicability of our model to the other species, and on the other hand showed enormous potential to improve the prediction of enzymes synthesizing PSM with the inclusion of valid data from a wider range of species.
植物特化(或次生)代谢物(PSM),也称为植物化学物质、天然产物或植物成分,在植物与环境的相互作用中起着至关重要的作用。尽管许多研究都集中在发现新的代谢物及其生物合成基因上,但代谢途径的解析和鉴定的生物合成基因受到基本分析方法和大量候选基因的限制。
在这里,我们整合了最先进的自动化机器学习(ML)框架 AutoGluon-Tabular 和来自拟南芥的多组学数据,以预测编码参与植物特化代谢物(PSM)生物合成的酶的基因,重点关注三个主要的 PSM 类别:萜类、生物碱和酚类。我们发现基因组学和蛋白质组学的相关特征是对模型性能贡献最大的两个关键特征类别。仅使用这些关键特征,我们在拟南芥中构建了一个新模型,该模型的性能优于使用包括转录组学和表观基因组学相关特征在内的更多特征构建的模型。最后,我们在玉米和番茄中进行了模型验证,在其他两个物种中进行测试并在另两个物种的数据中进行训练的模型在玉米中的表现与种内预测相当或更优。
我们在葡萄和罂粟上的外部验证结果一方面暗示了我们的模型在其他物种中的适用性,另一方面表明通过纳入来自更广泛物种的有效数据,极大地提高了预测合成 PSM 的酶的潜力。