Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada.
Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i30-i38. doi: 10.1093/bioinformatics/btae252.
Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes show that Floria is > 3× faster and recovers 21% more strain content than base-level assembly methods (Strainberry) while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took <20 min on average per sample and identified several species that have consistent strain heterogeneity. Applying Floria's short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses.
Floria is available at https://github.com/bluenote-1577/floria, and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow along with code for reproducing the benchmarks.
shotgun 宏基因组学允许直接分析微生物群落遗传学,但从微生物组中恢复细菌菌株基因组的可扩展计算方法仍然是一个关键挑战。我们介绍了 Floria,这是一种基于最小错误纠正(MEC)读聚类和菌株保留网络流模型的从短读和长读宏基因组测序数据中快速准确恢复菌株单倍型的新方法。Floria 可以作为一种独立的单倍型分析方法,输出在同一菌株上共同出现的等位基因和读取,也可以作为一个端到端的从读取到组装的流水线(Floria-PL)用于菌株水平的组装。在合成宏基因组上的基准评估表明,Floria 比基线组装方法(Strainberry)快 3 倍以上,恢复了 21%的菌株内容,而在仅需要相位时速度快一个数量级以上。将 Floria 应用于 109 个深度测序的纳米孔宏基因组,平均每个样本用时不到 20 分钟,并鉴定出了一些具有一致菌株异质性的物种。将 Floria 的短读单倍型分析应用于纵向肠道宏基因组数据集,揭示了一个动态的多菌株 Anaerostipes hadrus 群落,在 636 天内频繁出现菌株丢失和出现事件。使用 Floria,标准工作站上只需几分钟即可准确进行宏基因组数据集的单倍型分析,为广泛的菌株水平宏基因组分析铺平了道路。
Floria 可在 https://github.com/bluenote-1577/floria 上获得,Floria-PL 流水线可在 https://github.com/jsgounot/Floria_analysis_workflow 上获得,以及用于重现基准的代码。