Marini Simone, Boucher Christina, Noyes Noelle, Prosperi Mattia
Department of Epidemiology, University of Florida, Gainesville, FL, United States.
Department of Pathology, University of Florida, Gainesville, FL, United States.
Front Microbiol. 2023 Mar 7;14:1060891. doi: 10.3389/fmicb.2023.1060891. eCollection 2023.
Characterization of antibiotic resistance genes (ARGs) from high-throughput sequencing data of metagenomics and cultured bacterial samples is a challenging task, with the need to account for both computational (e.g., string algorithms) and biological (e.g., gene transfers, rearrangements) aspects. Curated ARG databases exist together with assorted ARG classification approaches (e.g., database alignment, machine learning). Besides ARGs that naturally occur in bacterial strains or are acquired through mobile elements, there are chromosomal genes that can render a bacterium resistant to antibiotics through point mutations, i.e., ARG variants (ARGVs). While ARG repositories also collect ARGVs, there are only a few tools that are able to identify ARGVs from metagenomics and high throughput sequencing data, with a number of limitations (e.g., pre-assembly, verification of mutations, or specification of species). In this work we present the -mer, i.e., strings of fixed length , ARGV analyzer - KARGVA - an open-source, multi-platform tool that provides: (i) an , large ARGV database derived from multiple sources; (ii) input capability for various types of high-throughput sequencing data; (iii) a three-way, hash-based, -mer search setup to process data efficiently, linking -mers to ARGVs, -mers to point mutations, and ARGVs to -mers, respectively; (iv) a statistical filter on sequence classification to reduce type I and II errors. On semi-synthetic data, KARGVA provides very high accuracy even in presence of high sequencing errors or mutations (99.2 and 86.6% accuracy within 1 and 5% base change rates, respectively), and genome rearrangements (98.2% accuracy), with robust performance on false positive sets. On data from the worldwide MetaSUB consortium, comprising 3,700+ metagenomics experiments, KARGVA identifies more ARGVs than Resistance Gene Identifier (4.8x) and PointFinder (6.8x), yet all predictions are below the expected false positive estimates. The prevalence of ARGVs is correlated to ARGs but ecological characteristics do not explain well ARGV variance. KARGVA is publicly available at https://github.com/DataIntellSystLab/KARGVA under MIT license.
从宏基因组学和培养细菌样本的高通量测序数据中鉴定抗生素抗性基因(ARG)是一项具有挑战性的任务,需要兼顾计算(如字符串算法)和生物学(如基因转移、重排)等方面。现有的经过整理的ARG数据库以及各种ARG分类方法(如数据库比对、机器学习)。除了细菌菌株中自然存在的或通过移动元件获得的ARG外,还有一些染色体基因可通过点突变使细菌对抗生素产生抗性,即ARG变体(ARGV)。虽然ARG库也收集ARGV,但只有少数工具能够从宏基因组学和高通量测序数据中识别ARGV,且存在一些局限性(如预组装、突变验证或物种指定)。在这项工作中,我们展示了 - 聚体,即固定长度的字符串,ARGV分析器 - KARGVA,这是一个开源的多平台工具,它提供:(i)一个来自多个来源的大型ARGV数据库;(ii)对各种类型高通量测序数据的输入能力;(iii)一种基于哈希的三元 - 聚体搜索设置,以高效处理数据,分别将 - 聚体与ARGV、 - 聚体与点突变以及ARGV与 - 聚体相联系;(iv)对序列分类的统计过滤,以减少I型和II型错误。在半合成数据上,即使存在高测序错误或突变(在碱基变化率分别为1%和5%时,准确率分别为99.2%和86.6%)以及基因组重排(准确率为98.2%),KARGVA也能提供非常高的准确率,在假阳性集上具有稳健的性能。在来自全球MetaSUB联盟的包含3700多个宏基因组学实验的数据上,KARGVA识别出的ARGV比抗性基因标识符(4.8倍)和PointFinder(6.8倍)更多,但所有预测都低于预期的假阳性估计值。ARGV的流行率与ARG相关,但生态特征并不能很好地解释ARGV的差异。KARGVA在MIT许可下可在https://github.com/DataIntellSystLab/KARGVA上公开获取。