Data Science and Learning Division, Argonne National Laboratory , Lemont, Illinois, USA.
Consortium for Advanced Science and Engineering, University of Chicago , Chicago, Illinois, USA.
mSystems. 2023 Aug 31;8(4):e0005823. doi: 10.1128/msystems.00058-23. Epub 2023 Jun 14.
Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%-90% of all genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including "hypothetical proteins" was accurately predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876-0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCE Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%-90% of all publicly available genomes. Overall, the results show that a large portion of the variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data.
能够预测不完整基因组或宏基因组组装基因组的蛋白质编码基因含量对于各种生物信息学任务非常重要。在这项研究中,作为概念验证,我们仅使用一组 100 个保守基因的核苷酸 k-mers 作为特征,构建了用于预测基因组中可变基因含量的机器学习分类器。蛋白质家族被用来定义直系同源物,并为预测出现在所有基因组的 10%-90%中的每个蛋白质家族的存在或不存在构建了一个单一的分类器。由此产生的 3259 个极端梯度提升分类器的每个基因组平均宏 F1 得分为 0.944 [0.943-0.945,95%置信区间]。我们表明,F1 分数在多基因座序列类型之间是稳定的,并且通过采样较少的核心基因或多样化的输入基因组,可以重现这种趋势。令人惊讶的是,对注释较差的蛋白质(包括“假设蛋白质”)的存在或不存在的预测也非常准确(F1 = 0.902 [0.898-0.906,95%置信区间])。具有水平基因转移相关功能的蛋白质的模型的 F1 得分略低,但仍然准确(转座子、噬菌体、质粒和抗微生物耐药性相关功能的 F1s 分别为 0.895、0.872、0.824 和 0.841)。最后,使用从淡水环境来源分离的 419 个不同的基因组的保留数据集,我们观察到每个基因组的平均 F1 分数为 0.880 [0.876-0.883,95%置信区间],证明了模型的可扩展性。总的来说,这项研究提供了一个使用有限数量的输入序列数据预测可变基因含量的框架。
重要性
预测基因组的蛋白质编码基因含量对于评估基因组质量、对来自鸟枪法宏基因组组装的基因组进行分类以及评估由于存在抗微生物耐药性和其他毒力基因而带来的风险非常重要。在这项研究中,我们构建了一组二进制分类器,用于预测出现在所有可用基因组的 10%-90%中的可变基因的存在或不存在。总的来说,结果表明,大部分基因组的可变基因含量可以高精度预测,包括与水平基因转移相关功能的基因。这项研究提供了一种使用有限输入序列数据预测基因含量的策略。