利用保守基因预测中的可变基因含量。

Predicting variable gene content in using conserved genes.

机构信息

Data Science and Learning Division, Argonne National Laboratory , Lemont, Illinois, USA.

Consortium for Advanced Science and Engineering, University of Chicago , Chicago, Illinois, USA.

出版信息

mSystems. 2023 Aug 31;8(4):e0005823. doi: 10.1128/msystems.00058-23. Epub 2023 Jun 14.

DOI:10.1128/msystems.00058-23

PMID:37314210

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10469788/

Abstract

Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%-90% of all genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including "hypothetical proteins" was accurately predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876-0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCE Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%-90% of all publicly available genomes. Overall, the results show that a large portion of the variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data.

摘要

能够预测不完整基因组或宏基因组组装基因组的蛋白质编码基因含量对于各种生物信息学任务非常重要。在这项研究中，作为概念验证，我们仅使用一组 100 个保守基因的核苷酸 k-mers 作为特征，构建了用于预测基因组中可变基因含量的机器学习分类器。蛋白质家族被用来定义直系同源物，并为预测出现在所有基因组的 10%-90%中的每个蛋白质家族的存在或不存在构建了一个单一的分类器。由此产生的 3259 个极端梯度提升分类器的每个基因组平均宏 F1 得分为 0.944 [0.943-0.945，95%置信区间]。我们表明，F1 分数在多基因座序列类型之间是稳定的，并且通过采样较少的核心基因或多样化的输入基因组，可以重现这种趋势。令人惊讶的是，对注释较差的蛋白质（包括“假设蛋白质”）的存在或不存在的预测也非常准确（F1 = 0.902 [0.898-0.906，95%置信区间]）。具有水平基因转移相关功能的蛋白质的模型的 F1 得分略低，但仍然准确（转座子、噬菌体、质粒和抗微生物耐药性相关功能的 F1s 分别为 0.895、0.872、0.824 和 0.841）。最后，使用从淡水环境来源分离的 419 个不同的基因组的保留数据集，我们观察到每个基因组的平均 F1 分数为 0.880 [0.876-0.883，95%置信区间]，证明了模型的可扩展性。总的来说，这项研究提供了一个使用有限数量的输入序列数据预测可变基因含量的框架。

重要性

预测基因组的蛋白质编码基因含量对于评估基因组质量、对来自鸟枪法宏基因组组装的基因组进行分类以及评估由于存在抗微生物耐药性和其他毒力基因而带来的风险非常重要。在这项研究中，我们构建了一组二进制分类器，用于预测出现在所有可用基因组的 10%-90%中的可变基因的存在或不存在。总的来说，结果表明，大部分基因组的可变基因含量可以高精度预测，包括与水平基因转移相关功能的基因。这项研究提供了一种使用有限输入序列数据预测基因含量的策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/696b/10469788/887e7dbc191a/msystems.00058-23.f001.jpg

相似文献

Predicting variable gene content in using conserved genes.利用保守基因预测中的可变基因含量。

mSystems. 2023 Aug 31;8(4):e0005823. doi: 10.1128/msystems.00058-23. Epub 2023 Jun 14.

Insights into the environmental resistance gene pool from the genome sequence of the multidrug-resistant environmental isolate Escherichia coli SMS-3-5.从多重耐药环境分离株大肠杆菌SMS-3-5的基因组序列洞察环境抗性基因库。

J Bacteriol. 2008 Oct;190(20):6779-94. doi: 10.1128/JB.00661-08. Epub 2008 Aug 15.

Improving hazard characterization in microbial risk assessment using next generation sequencing data and machine learning: Predicting clinical outcomes in shigatoxigenic Escherichia coli.利用下一代测序数据和机器学习改进微生物风险评估中的危害特征描述：预测志贺毒素产生性大肠杆菌的临床结果。

Int J Food Microbiol. 2019 Mar 2;292:72-82. doi: 10.1016/j.ijfoodmicro.2018.11.016. Epub 2018 Dec 4.

Genome Informatics and Machine Learning-Based Identification of Antimicrobial Resistance-Encoding Features and Virulence Attributes in Escherichia coli Genomes Representing Globally Prevalent Lineages, Including High-Risk Clonal Complexes.基于基因组信息学和机器学习的方法鉴定全球流行谱系（包括高风险克隆群）中大肠杆菌基因组中的抗药性编码特征和毒力属性。

mBio. 2021 Feb 22;13(1):e0379621. doi: 10.1128/mbio.03796-21. Epub 2022 Feb 15.

Characterization of a Large Antibiotic Resistance Plasmid Found in Enteropathogenic Escherichia coli Strain B171 and Its Relatedness to Plasmids of Diverse E. coli and Shigella Strains.肠致病性大肠杆菌 B171 中发现的一种大抗生素抗性质粒的特性及其与不同大肠杆菌和志贺氏菌菌株质粒的相关性。

Antimicrob Agents Chemother. 2017 Aug 24;61(9). doi: 10.1128/AAC.00995-17. Print 2017 Sep.

An Escherichia coli ST131 pangenome atlas reveals population structure and evolution across 4,071 isolates.大肠杆菌 ST131 泛基因组图谱揭示了 4071 个分离株的种群结构和进化。

Sci Rep. 2019 Nov 22;9(1):17394. doi: 10.1038/s41598-019-54004-5.

mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species.mlplasmids：一个用于预测单物种质粒和染色体衍生序列的用户友好型工具。

Microb Genom. 2018 Nov;4(11). doi: 10.1099/mgen.0.000224. Epub 2018 Nov 1.

Complete Genetic Analysis of Plasmids Carried by Two Nonclonal - and -Bearing Escherichia coli Strains: Insight into Plasmid Transmission among Foodborne Bacteria.对两株非克隆携带-和-大肠杆菌菌株携带质粒的完整遗传分析：食源性病原体中质粒传播的见解。

Microbiol Spectr. 2021 Oct 31;9(2):e0021721. doi: 10.1128/Spectrum.00217-21. Epub 2021 Sep 1.

Plasmer: an Accurate and Sensitive Bacterial Plasmid Prediction Tool Based on Machine Learning of Shared k-mers and Genomic Features.Plasmer：一种基于共享 k-mers 和基因组特征的机器学习的准确且灵敏的细菌质粒预测工具。

Microbiol Spectr. 2023 Jun 15;11(3):e0464522. doi: 10.1128/spectrum.04645-22. Epub 2023 May 16.

Evolutionary Responses to Acquiring a Multidrug Resistance Plasmid Are Dominated by Metabolic Functions across Diverse Escherichia coli Lineages.在不同的大肠杆菌谱系中，获得多药耐药质粒的进化反应主要由代谢功能主导。

mSystems. 2023 Feb 23;8(1):e0071322. doi: 10.1128/msystems.00713-22. Epub 2023 Feb 1.

本文引用的文献

Predicting metabolic modules in incomplete bacterial genomes with MetaPathPredict.使用 MetaPathPredict 预测不完整细菌基因组中的代谢模块。

Elife. 2024 May 2;13:e85749. doi: 10.7554/eLife.85749.

CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning.CheckM2：一种使用机器学习快速、可扩展且准确评估微生物基因组质量的工具。

Nat Methods. 2023 Aug;20(8):1203-1212. doi: 10.1038/s41592-023-01940-w. Epub 2023 Jul 27.

Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR.推出细菌和病毒生物信息学资源中心（BV-BRC）：一个整合 PATRIC、IRD 和 ViPR 的资源。

Nucleic Acids Res. 2023 Jan 6;51(D1):D678-D689. doi: 10.1093/nar/gkac1003.

Exploration of machine learning algorithms for predicting the changes in abundance of antibiotic resistance genes in anaerobic digestion.探索机器学习算法在预测厌氧消化中抗生素耐药基因丰度变化中的应用。

Sci Total Environ. 2022 Sep 15;839:156211. doi: 10.1016/j.scitotenv.2022.156211. Epub 2022 May 24.

Inferring microbiota functions from taxonomic genes: a review.从分类基因推断微生物群落功能：综述。

Gigascience. 2022 Jan 12;11(1). doi: 10.1093/gigascience/giab090.

MicFunPred: A conserved approach to predict functional profiles from 16S rRNA gene sequence data.MicFunPred：一种从16S rRNA基因序列数据预测功能谱的保守方法。

Genomics. 2021 Nov;113(6):3635-3643. doi: 10.1016/j.ygeno.2021.08.016. Epub 2021 Aug 24.

A genomic data resource for predicting antimicrobial resistance from laboratory-derived antimicrobial susceptibility phenotypes.一个从实验室获得的抗菌药物敏感性表型预测抗菌药物耐药性的基因组数据资源。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab313.

Predicting antibiotic resistance gene abundance in activated sludge using shotgun metagenomics and machine learning.利用鸟枪法宏基因组学和机器学习预测活性污泥中的抗生素耐药基因丰度。

Water Res. 2021 Sep 1;202:117384. doi: 10.1016/j.watres.2021.117384. Epub 2021 Jun 26.

Predicting Antimicrobial Resistance Using Partial Genome Alignments.利用部分基因组比对预测抗菌药物耐药性

mSystems. 2021 Jun 29;6(3):e0018521. doi: 10.1128/mSystems.00185-21. Epub 2021 Jun 15.

Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences.Tax4Fun2：基于16S rRNA基因序列预测特定栖息地的功能谱和功能冗余

Environ Microbiome. 2020 May 18;15(1):11. doi: 10.1186/s40793-020-00358-7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用保守基因预测 中的可变基因含量。

Predicting variable gene content in using conserved genes.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献

利用保守基因预测中的可变基因含量。