Department of Biology, University of Waterloo, Waterloo, ON, Canada.
Genome Biol Evol. 2023 Jul 3;15(7). doi: 10.1093/gbe/evad129.
C4 photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22-24 origins, encompassing greater than 50% of all known C4 species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C4 photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C3 and C4 distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C3/C4 classification models. Our RbcL-trained model, for example, informs a C3/C4 classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C3 versus C4 metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C3/C4 status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C4 photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships.
C4 光合作用至少有 61 个独立起源于植物谱系,是趋同进化最显著的例子之一。在 >60 个独立起源中,预测有 22-24 个起源,占所有已知 C4 物种的 50%以上,存在于禾本科的 Panicoideae、Arundinoideae、Chloridoideae、Micrairoideae、Aristidoideae 和 Danthonioideae(PACMAD)分支中。因此,这个分支有理想的物种,适合研究与获得 C4 光合作用特性相关的基因组变化。在这项研究中,我们利用不断增长的已测序质体基因组的可用性,并采用机器学习 (ML) 方法筛选 PACMAD 物种中含有 C3 和 C4 区分信息的质体基因。我们证明某些质体编码的蛋白质序列具有区分和信息丰富的序列信息,使它们能够训练准确的 ML C3/C4 分类模型。例如,我们训练的 RbcL 模型以大于 99%的准确率告知 C3/C4 分类器。从单个序列准确预测光合作用类型表明这些序列产物在 C3 与 C4 代谢中具有生物学上相关的、潜在不同的作用。使用这种 ML 框架,我们已经确定了几个最能预测 C3/C4 状态的关键序列和位点,包括 RbcL、NAD(P)H 脱氢酶复合物的亚基以及其中的特定残基,进一步强调了它们在 C4 光合作用机制的进化和/或维持中的潜在意义。这种通用方法可用于揭示其他类似基因型-表型关系之间的复杂关联。