Suppr超能文献

使用 ggCaller 实现基于图的精确快速泛基因组注释和聚类。

Accurate and fast graph-based pangenome annotation and clustering with ggCaller.

机构信息

MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London W12 0BZ, United Kingdom;

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom.

出版信息

Genome Res. 2023 Sep;33(9):1622-1637. doi: 10.1101/gr.277733.123. Epub 2023 Aug 24.

Abstract

Bacterial genomes differ in both gene content and sequence mutations, which underlie extensive phenotypic diversity, including variation in susceptibility to antimicrobials or vaccine-induced immunity. To identify and quantify important variants, all genes within a population must be predicted, functionally annotated, and clustered, representing the "pangenome." Despite the volume of genome data available, gene prediction and annotation are currently conducted in isolation on individual genomes, which is computationally inefficient and frequently inconsistent across genomes. Here, we introduce the open-source software graph-gene-caller (ggCaller). ggCaller combines gene prediction, functional annotation, and clustering into a single workflow using population-wide de Bruijn graphs, removing redundancy in gene annotation and resulting in more accurate gene predictions and orthologue clustering. We applied ggCaller to simulated and real-world bacterial data sets containing hundreds or thousands of genomes, comparing it to current state-of-the-art tools. ggCaller has considerable speed-ups with equivalent or greater accuracy, particularly with data sets containing complex sources of error, such as assembly contamination or fragmentation. ggCaller is also an important extension to bacterial genome-wide association studies, enabling querying of annotated graphs for functional analyses. We highlight this application by functionally annotating DNA sequences with significant associations to tetracycline and macrolide resistance in , identifying key resistance determinants that were missed when using only a single reference genome. ggCaller is a novel bacterial genome analysis tool with applications in bacterial evolution and epidemiology.

摘要

细菌基因组在基因内容和序列突变方面存在差异,这些差异是广泛表型多样性的基础,包括对抗生素的敏感性或疫苗诱导的免疫的变化。为了识别和量化重要的变体,必须预测、功能注释和聚类群体中的所有基因,这代表了“泛基因组”。尽管有大量的基因组数据可用,但目前在个体基因组上独立进行基因预测和注释,这在计算上效率低下,并且在基因组之间经常不一致。在这里,我们介绍了开源软件 graph-gene-caller(ggCaller)。ggCaller 使用全基因组的 de Bruijn 图将基因预测、功能注释和聚类组合到一个单一的工作流程中,去除基因注释中的冗余,从而产生更准确的基因预测和直系同源聚类。我们将 ggCaller 应用于包含数百或数千个基因组的模拟和真实细菌数据集,并将其与当前最先进的工具进行了比较。ggCaller 具有相当大的加速作用,并且具有相同或更高的准确性,特别是在包含复杂错误源(例如组装污染或碎片化)的数据集中。ggCaller 也是细菌全基因组关联研究的重要扩展,允许对注释图进行查询以进行功能分析。我们通过对与四环素和大环内酯类抗生素耐药性有显著关联的 DNA 序列进行功能注释来突出这一应用,确定了在仅使用单个参考基因组时错过的关键耐药决定因素。ggCaller 是一种新型的细菌基因组分析工具,可应用于细菌进化和流行病学。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70e4/10620059/e773cf92b20f/1622f01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验