Alipanahi Bahar, Muggli Martin D, Jundi Musa, Noyes Noelle R, Boucher Christina
Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA.
Bioinformatics. 2021 Apr 1;36(22-23):5275-5281. doi: 10.1093/bioinformatics/btaa081.
Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need.
We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets.
Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari.
Supplementary data are available at Bioinformatics online.
宏基因组学是指对包含多个个体生物遗传内容的复杂样本进行研究,因此已被用于阐明复杂样本的微生物组和抗性组。微生物组是指样本中的所有微生物,而抗性组是指致病和非致病细菌中所有的抗微生物抗性(AMR)基因。单核苷酸多态性(SNP)可有效地用于对微生物组和抗性组内的特定生物和基因进行“指纹识别”,并追踪它们在各种样本中的移动。然而,为了有效地将这些SNP用于这种可追溯性,需要一个可扩展且准确的宏基因组学SNP调用程序。此外,这样的SNP调用程序不应依赖参考基因组,因为95%的微生物物种无法培养,这使得确定参考基因组极具挑战性。在本文中,我们满足了这一需求。
我们提出了LueVari,这是一种基于读取着色德布鲁因图的无参考SNP调用程序,它是传统德布鲁因图的扩展,能够明确识别长度大于k-mer长度且小于读取长度的重复区域。LueVari能够从鸟枪法宏基因组学数据中识别AMR基因和染色体DNA中的SNP,其灵敏度(91%至99%)和精度(71%至99%)可靠,而竞争方法的性能差异很大。此外,我们表明LueVari构建了包含变异的序列,这些序列在数据集中跨越高达97.8%的基因,这有助于在大型宏基因组数据集中检测不同的AMR基因。
代码和数据集可在https://github.com/baharpan/cosmo/tree/LueVari上公开获取。
补充数据可在《生物信息学》在线版上获取。