Moore Bethany M, Wang Peipei, Fan Pengxiang, Leong Bryan, Schenck Craig A, Lloyd John P, Lehti-Shiu Melissa D, Last Robert L, Pichersky Eran, Shiu Shin-Han
Department of Plant Biology, Michigan State University, East Lansing, MI 48824.
Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, MI 48824.
Proc Natl Acad Sci U S A. 2019 Feb 5;116(6):2344-2353. doi: 10.1073/pnas.1817074116. Epub 2019 Jan 23.
Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.
植物特殊代谢(SM)酶产生具有重要生态、进化和生物技术意义的谱系特异性代谢产物。以 为模型,我们通过对包括复制模式、序列保守性、转录、蛋白质结构域含量和基因网络特性等特征的详细研究,确定了SM和GM(一般代谢,传统上称为初级代谢)基因的区别特征。对多组基准基因的分析表明,相对于GM基因,SM基因倾向于串联重复,与其旁系同源基因共表达,表达水平较低且范围较窄,保守性较差,在基因网络中的连接性也较差。尽管这些特征在SM和GM基因之间的每一个值都有显著差异,但任何单个特征都无法有效地从GM基因中预测SM基因。使用机器学习方法整合所有特征,建立了一个预测模型,其真阳性率为87%,真阴性率为71%。此外,86%未用于创建机器学习模型的已知SM基因也被预测出来。我们还证明,当我们区分负责SM和GM途径共享反应的SM、GM和连接基因时,该模型可以进一步改进,这表明拓扑学考虑可能会进一步改进SM预测模型。预测模型的应用导致鉴定出1220个功能未知的基因,每个基因都被赋予一个称为SM分数的置信度度量,从而对植物基因组中的SM基因含量进行全局估计。