Yocca Alan E, Edger Patrick P
Dep. of Plant Biology, Michigan State Univ., East Lansing, MI, 48824, USA.
Dep. of Horticulture, Michigan State Univ., East Lansing, MI, 48824, USA.
Plant Genome. 2022 Mar;15(1):e20135. doi: 10.1002/tpg2.20135. Epub 2021 Sep 17.
A gene in a given taxonomic group is either present in every individual (core) or absent in at least a single individual (dispensable). Previous pangenomic studies have identified certain functional differences between core and dispensable genes. However, identifying if a gene belongs to the core or dispensable portion of the genome requires the construction of a pangenome, which involves sequencing the genomes of many individuals. Here we aim to leverage the previously characterized core and dispensable gene content for two grass species [Brachypodium distachyon (L.) P. Beauv. and Oryza sativa L.] to construct a machine learning model capable of accurately classifying genes as core or dispensable using only a single annotated reference genome. Such a model may mitigate the need for pangenome construction, an expensive hurdle especially in orphan crops, which often lack the adequate genomic resources.
在给定的分类群中,一个基因要么存在于每个个体中(核心基因),要么至少在一个个体中不存在(可有可无基因)。先前的泛基因组研究已经确定了核心基因和可有可无基因之间的某些功能差异。然而,要确定一个基因属于基因组的核心部分还是可有可无部分,需要构建一个泛基因组,这涉及对许多个体的基因组进行测序。在这里,我们旨在利用先前表征的两种禾本科植物[短柄草(Brachypodium distachyon (L.) P. Beauv.)和水稻(Oryza sativa L.)]的核心基因和可有可无基因内容,构建一个机器学习模型,该模型仅使用单个注释参考基因组就能准确地将基因分类为核心基因或可有可无基因。这样一个模型可能会减少构建泛基因组的需求,泛基因组构建是一个昂贵的障碍,尤其是在孤儿作物中,这些作物往往缺乏足够的基因组资源。