流式直方图概要分析快速微生物组分析。

Streaming histogram sketching for rapid microbiome analytics.

机构信息

Scientific Computing Department, STFC Daresbury Laboratory, Warrington, UK.

IBM Research, The Hartree Centre, Warrington, UK.

出版信息

Microbiome. 2019 Mar 16;7(1):40. doi: 10.1186/s40168-019-0653-2.

DOI:10.1186/s40168-019-0653-2

PMID:30878035

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6420756/

Abstract

BACKGROUND

The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.

RESULTS

We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed 'histosketch' that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a 'real life' example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s.

CONCLUSIONS

Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. ( https://github.com/will-rowe/hulk ).

摘要

背景

近年来，公共微生物组数据的增长为基因组研究提供了宝贵的资源，使得能够设计新的研究，增加新的数据集，并重新分析已发表的工作。大量的微生物组数据，以及微生物组研究的广泛普及和临床宏基因组学的即将到来，意味着迫切需要开发能够在短时间内处理大量数据的分析工具。为了满足这一需求，我们提出了一种使用流 k-mer 谱相似性保留草图对微生物组测序数据进行紧凑表示的新方法。这些草图允许进行不相似性估计、快速微生物组目录搜索和微生物组样本的分类，几乎可以实时进行。

结果

我们将流直方图草图应用于微生物组样本作为一种降维形式，创建了一个可以有效表示微生物组 k-mer 谱的压缩“histosketch”。使用公共微生物组数据集，我们表明可以使用样本类型的成对 Jaccard 相似性估计对 histosketches 进行聚类，从而可以通过局部敏感哈希索引方案快速进行微生物组相似性搜索。此外，我们使用一个“现实生活”的例子来说明 histosketches 可以训练机器学习分类器来准确标记微生物组样本。具体来说，使用来自早产儿队列的 108 个新型微生物组样本的集合，我们训练并测试了一个随机森林分类器，该分类器可以准确预测新生儿是否接受了抗生素治疗（准确率为 97%，精度为 96%），并且可以随后用于在不到 3 秒的时间内对微生物组数据流进行分类。

结论

我们的方法为快速处理微生物组数据流提供了一种新方法，允许快速对样本进行聚类、索引和分类。我们还提供了我们的实现，即使用小 k-mer 的 Histosketching（HULK），它可以在标准笔记本电脑上使用四个核在 50 秒内对典型的 2GB 微生物组进行 histosketching，草图占用 3000 字节的磁盘空间。（https://github.com/will-rowe/hulk）

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/a2e60a0a23fa/40168_2019_653_Fig1_HTML.jpg

相似文献

Streaming histogram sketching for rapid microbiome analytics.流式直方图概要分析快速微生物组分析。

Microbiome. 2019 Mar 16;7(1):40. doi: 10.1186/s40168-019-0653-2.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

Fractional hitting sets for efficient multiset sketching.用于高效多重集草图绘制的分数击中集

Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.

Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.

bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.Kssd：通过 K-mer 子串空间采样进行序列降维，实现实时大规模数据集分析。

Genome Biol. 2021 Mar 16;22(1):84. doi: 10.1186/s13059-021-02303-4.

Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.使用最小去环集保证小窗口的草图方法。

J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts.KrakenUniq：基于独特的 k-mer 计数实现自信且快速的宏基因组分类。

Genome Biol. 2018 Nov 16;19(1):198. doi: 10.1186/s13059-018-1568-0.

An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data.利用纵向宏基因组数据的综合菌株水平分析管道。

Microbiol Spectr. 2024 Nov 5;12(11):e0143124. doi: 10.1128/spectrum.01431-24. Epub 2024 Sep 23.

Metagenomic functional profiling: to sketch or not to sketch?宏基因组功能谱分析：描绘还是不描绘？

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.

引用本文的文献

Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.近似最近邻图为大规模生物数据的应用提供了快速有效的嵌入。

NAR Genom Bioinform. 2024 Dec 18;6(4):lqae172. doi: 10.1093/nargab/lqae172. eCollection 2024 Dec.

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.GSearch：通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。

Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.

Microbiome-based classification models for fresh produce safety and quality evaluation.基于微生物组的分类模型在新鲜农产品安全和质量评价中的应用。

Microbiol Spectr. 2024 Apr 2;12(4):e0344823. doi: 10.1128/spectrum.03448-23. Epub 2024 Mar 6.

Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets.Struo2：为不断扩展的微生物基因组数据集构建高效的宏基因组分析数据库。

PeerJ. 2021 Sep 16;9:e12198. doi: 10.7717/peerj.12198. eCollection 2021.

Explainable AI reveals changes in skin microbiome composition linked to phenotypic differences.可解释人工智能揭示了与表型差异相关的皮肤微生物组组成变化。

Sci Rep. 2021 Feb 25;11(1):4565. doi: 10.1038/s41598-021-83922-6.

Streamlining data-intensive biology with workflow systems.使用工作流程系统简化数据密集型生物学研究。

Gigascience. 2021 Jan 13;10(1). doi: 10.1093/gigascience/giaa140.

Microbiota Supplementation with and Modifies the Preterm Infant Gut Microbiota and Metabolome: An Observational Study.双歧杆菌和乳杆菌补充剂改变早产儿肠道微生物群和代谢组：一项观察性研究。

Cell Rep Med. 2020 Aug 25;1(5):100077. doi: 10.1016/j.xcrm.2020.100077.

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.从 PB 级到更多：概率和信号处理算法的最新进展及其在宏基因组学中的应用。

Nucleic Acids Res. 2020 Jun 4;48(10):5217-5234. doi: 10.1093/nar/gkaa265.

When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data.决堤之时：算法速写实用指南，助你应对基因组洪流。

Genome Biol. 2019 Sep 13;20(1):199. doi: 10.1186/s13059-019-1809-x.

本文引用的文献

Recommendations for the packaging and containerizing of bioinformatics software.生物信息学软件的包装与容器化建议。

F1000Res. 2018 Jun 14;7. doi: 10.12688/f1000research.15140.2. eCollection 2018.

Highlighting Clinical Metagenomics for Enhanced Diagnostic Decision-making: A Step Towards Wider Implementation.突出临床宏基因组学以加强诊断决策：迈向更广泛应用的一步。

Comput Struct Biotechnol J. 2018 Feb 27;16:108-120. doi: 10.1016/j.csbj.2018.02.006. eCollection 2018.

Metagenomic binning through low-density hashing.基于低密度哈希的宏基因组 bin 划分。

Bioinformatics. 2019 Jan 15;35(2):219-226. doi: 10.1093/bioinformatics/bty611.

Bioconda: sustainable and comprehensive software distribution for the life sciences.生物conda：面向生命科学的可持续且全面的软件发行平台。

Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7.

Indexed variation graphs for efficient and accurate resistome profiling.索引变异图可实现高效准确的抗药基因谱分析。

Bioinformatics. 2018 Nov 1;34(21):3601-3608. doi: 10.1093/bioinformatics/bty387.

Microbiome. 2018 Apr 19;6(1):72. doi: 10.1186/s40168-018-0450-3.

Optimisation of 16S rRNA gut microbiota profiling of extremely low birth weight infants.优化极低出生体重儿 16S rRNA 肠道微生物群分析。

BMC Genomics. 2017 Nov 2;18(1):841. doi: 10.1186/s12864-017-4229-x.

A communal catalogue reveals Earth's multiscale microbial diversity.一份公共目录揭示了地球的多尺度微生物多样性。

Nature. 2017 Nov 23;551(7681):457-463. doi: 10.1038/nature24621. Epub 2017 Nov 1.

Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.宏基因组解读的批判性评估——宏基因组学软件的一项基准测试

Nat Methods. 2017 Nov;14(11):1063-1071. doi: 10.1038/nmeth.4458. Epub 2017 Oct 2.

Mash: fast genome and metagenome distance estimation using MinHash.Mash：使用MinHash进行快速的基因组和宏基因组距离估计。

Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

流式直方图概要分析快速微生物组分析。

Streaming histogram sketching for rapid microbiome analytics.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献