Shaw Jim, Yu Yun William
Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
Nat Biotechnol. 2024 Oct 8. doi: 10.1038/s41587-024-02412-y.
Profiling metagenomes against databases allows for the detection and quantification of microorganisms, even at low abundances where assembly is not possible. We introduce sylph, a species-level metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. On the Critical Assessment of Metagenome Interpretation II (CAMI2) Marine dataset, sylph was the most accurate profiling method of seven tested. For multisample profiling, sylph took >10-fold less central processing unit time compared to Kraken2 and used 30-fold less memory. Sylph's ANI estimates provided an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson disease (PD) against 289,232 genomes while confirming known butyrate-PD associations at the strain level. Sylph took <1 min and 16 GB of random-access memory to profile metagenomes against 85,205 prokaryotic and 2,917,516 viral genomes, detecting 30-fold more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate containment ANI estimation even for low-coverage genomes.
将宏基因组与数据库进行比对能够检测和定量微生物,即使在丰度较低且无法进行组装的情况下也是如此。我们引入了Sylph,这是一种物种水平的宏基因组分析工具,它通过零膨胀泊松k-mer统计来估计基因组与宏基因组的包含平均核苷酸同一性(ANI),从而实现基于ANI的分类群检测。在宏基因组解释关键评估II(CAMI2)海洋数据集上,Sylph是七种测试方法中最准确的分析方法。对于多样本分析,与Kraken2相比,Sylph的中央处理器时间减少了10倍以上,内存使用量减少了30倍。Sylph的ANI估计为丰度提供了一个正交信号,从而能够针对帕金森病(PD)开展一项基于ANI的全宏基因组关联研究,该研究涉及289,232个基因组,同时在菌株水平上证实了已知的丁酸盐与PD的关联。Sylph在使用16GB随机存取内存的情况下,不到1分钟就能完成针对85,205个原核生物基因组和2,917,516个病毒基因组的宏基因组分析,与RefSeq相比,在人类肠道中检测到的病毒序列多出30倍。即使对于低覆盖度基因组,Sylph也能提供精确、高效的分析以及准确的包含ANI估计。