School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, Pennsylvania 16802, United States.
Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States.
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.
Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k-mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k-mer-sketching algorithm, to obtain functional profiles of metagenome samples.
We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99× faster in wall-clock time, and consumes up to 40-55× less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets.
This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: https://github.com/KoslickiLab/fmh-funprofiler. All scripts of the analyses we present in this manuscript can be found on GitHub.
对宏基因组样本进行功能分析对于破译微生物群落的功能能力至关重要。传统的和更广泛使用的宏基因组学中的功能分析器依赖于将读取序列与已知参考数据库进行比对。然而,将测序读取序列与大型且快速增长的数据库进行比对在计算上是非常昂贵的。一般来说,基于 k-mer 的草图技术已成功应用于宏基因组学中,以解决这一瓶颈问题,特别是在分类学分析中。在这项工作中,我们描述了利用 FracMinHash(在开源软件 sourmash 中实现),一种 k-mer 草图算法,来获取宏基因组样本的功能图谱。
我们展示了如何将 sourmash 软件的部分内容(以及由此产生的 FracMinHash 草图)组合成一个管道,以对宏基因组样本进行功能分析。我们将这个管道命名为 fmh-funprofiler。我们报告说,当应用于模拟宏基因组数据时,与其他基于比对的方法相比,使用该管道获得的功能图谱具有相当的完整性和更好的纯度。我们还报告说,fmh-funprofiler 在处理时间上快 39-99 倍,消耗的内存少 40-55 倍。与 KEGG 数据库结合使用,该方法不仅复制了基本的生物学见解,还突出了人类微生物组计划数据集的新信号。
这个快速且轻量级的宏基因组功能分析器是免费提供的,可以在这里访问:https://github.com/KoslickiLab/fmh-funprofiler。我们在本文中呈现的所有分析脚本都可以在 GitHub 上找到。