• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

流式直方图概要分析快速微生物组分析。

Streaming histogram sketching for rapid microbiome analytics.

机构信息

Scientific Computing Department, STFC Daresbury Laboratory, Warrington, UK.

IBM Research, The Hartree Centre, Warrington, UK.

出版信息

Microbiome. 2019 Mar 16;7(1):40. doi: 10.1186/s40168-019-0653-2.

DOI:10.1186/s40168-019-0653-2
PMID:30878035
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6420756/
Abstract

BACKGROUND

The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.

RESULTS

We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed 'histosketch' that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a 'real life' example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s.

CONCLUSIONS

Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. ( https://github.com/will-rowe/hulk ).

摘要

背景

近年来,公共微生物组数据的增长为基因组研究提供了宝贵的资源,使得能够设计新的研究,增加新的数据集,并重新分析已发表的工作。大量的微生物组数据,以及微生物组研究的广泛普及和临床宏基因组学的即将到来,意味着迫切需要开发能够在短时间内处理大量数据的分析工具。为了满足这一需求,我们提出了一种使用流 k-mer 谱相似性保留草图对微生物组测序数据进行紧凑表示的新方法。这些草图允许进行不相似性估计、快速微生物组目录搜索和微生物组样本的分类,几乎可以实时进行。

结果

我们将流直方图草图应用于微生物组样本作为一种降维形式,创建了一个可以有效表示微生物组 k-mer 谱的压缩“histosketch”。使用公共微生物组数据集,我们表明可以使用样本类型的成对 Jaccard 相似性估计对 histosketches 进行聚类,从而可以通过局部敏感哈希索引方案快速进行微生物组相似性搜索。此外,我们使用一个“现实生活”的例子来说明 histosketches 可以训练机器学习分类器来准确标记微生物组样本。具体来说,使用来自早产儿队列的 108 个新型微生物组样本的集合,我们训练并测试了一个随机森林分类器,该分类器可以准确预测新生儿是否接受了抗生素治疗(准确率为 97%,精度为 96%),并且可以随后用于在不到 3 秒的时间内对微生物组数据流进行分类。

结论

我们的方法为快速处理微生物组数据流提供了一种新方法,允许快速对样本进行聚类、索引和分类。我们还提供了我们的实现,即使用小 k-mer 的 Histosketching(HULK),它可以在标准笔记本电脑上使用四个核在 50 秒内对典型的 2GB 微生物组进行 histosketching,草图占用 3000 字节的磁盘空间。(https://github.com/will-rowe/hulk)

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/3af4dfcd67ec/40168_2019_653_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/a2e60a0a23fa/40168_2019_653_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/f4f4617e3d24/40168_2019_653_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/43600eb7e7ae/40168_2019_653_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/3af4dfcd67ec/40168_2019_653_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/a2e60a0a23fa/40168_2019_653_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/f4f4617e3d24/40168_2019_653_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/43600eb7e7ae/40168_2019_653_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ff8/6420756/3af4dfcd67ec/40168_2019_653_Fig4_HTML.jpg

相似文献

1
Streaming histogram sketching for rapid microbiome analytics.流式直方图概要分析快速微生物组分析。
Microbiome. 2019 Mar 16;7(1):40. doi: 10.1186/s40168-019-0653-2.
2
Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。
BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.
3
Fractional hitting sets for efficient multiset sketching.用于高效多重集草图绘制的分数击中集
Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.
4
Estimating similarity and distance using FracMinHash.使用FracMinHash估计相似度和距离。
Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.
5
Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation.使用FracMinHash的余弦相似度估计:理论分析、安全条件及实现
bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.
6
Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.Kssd:通过 K-mer 子串空间采样进行序列降维,实现实时大规模数据集分析。
Genome Biol. 2021 Mar 16;22(1):84. doi: 10.1186/s13059-021-02303-4.
7
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.使用最小去环集保证小窗口的草图方法。
J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.
8
KrakenUniq: confident and fast metagenomics classification using unique k-mer counts.KrakenUniq:基于独特的 k-mer 计数实现自信且快速的宏基因组分类。
Genome Biol. 2018 Nov 16;19(1):198. doi: 10.1186/s13059-018-1568-0.
9
An integrated strain-level analytic pipeline utilizing longitudinal metagenomic data.利用纵向宏基因组数据的综合菌株水平分析管道。
Microbiol Spectr. 2024 Nov 5;12(11):e0143124. doi: 10.1128/spectrum.01431-24. Epub 2024 Sep 23.
10
Metagenomic functional profiling: to sketch or not to sketch?宏基因组功能谱分析:描绘还是不描绘?
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397.

引用本文的文献

1
Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.近似最近邻图为大规模生物数据的应用提供了快速有效的嵌入。
NAR Genom Bioinform. 2024 Dec 18;6(4):lqae172. doi: 10.1093/nargab/lqae172. eCollection 2024 Dec.
2
GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.GSearch:通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。
Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.
3
Microbiome-based classification models for fresh produce safety and quality evaluation.

本文引用的文献

1
Recommendations for the packaging and containerizing of bioinformatics software.生物信息学软件的包装与容器化建议。
F1000Res. 2018 Jun 14;7. doi: 10.12688/f1000research.15140.2. eCollection 2018.
2
Highlighting Clinical Metagenomics for Enhanced Diagnostic Decision-making: A Step Towards Wider Implementation.突出临床宏基因组学以加强诊断决策:迈向更广泛应用的一步。
Comput Struct Biotechnol J. 2018 Feb 27;16:108-120. doi: 10.1016/j.csbj.2018.02.006. eCollection 2018.
3
Metagenomic binning through low-density hashing.基于低密度哈希的宏基因组 bin 划分。
基于微生物组的分类模型在新鲜农产品安全和质量评价中的应用。
Microbiol Spectr. 2024 Apr 2;12(4):e0344823. doi: 10.1128/spectrum.03448-23. Epub 2024 Mar 6.
4
Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。
Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.
5
Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets.Struo2:为不断扩展的微生物基因组数据集构建高效的宏基因组分析数据库。
PeerJ. 2021 Sep 16;9:e12198. doi: 10.7717/peerj.12198. eCollection 2021.
6
Explainable AI reveals changes in skin microbiome composition linked to phenotypic differences.可解释人工智能揭示了与表型差异相关的皮肤微生物组组成变化。
Sci Rep. 2021 Feb 25;11(1):4565. doi: 10.1038/s41598-021-83922-6.
7
Streamlining data-intensive biology with workflow systems.使用工作流程系统简化数据密集型生物学研究。
Gigascience. 2021 Jan 13;10(1). doi: 10.1093/gigascience/giaa140.
8
Microbiota Supplementation with and Modifies the Preterm Infant Gut Microbiota and Metabolome: An Observational Study.双歧杆菌和乳杆菌补充剂改变早产儿肠道微生物群和代谢组:一项观察性研究。
Cell Rep Med. 2020 Aug 25;1(5):100077. doi: 10.1016/j.xcrm.2020.100077.
9
To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.从 PB 级到更多:概率和信号处理算法的最新进展及其在宏基因组学中的应用。
Nucleic Acids Res. 2020 Jun 4;48(10):5217-5234. doi: 10.1093/nar/gkaa265.
10
When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data.决堤之时:算法速写实用指南,助你应对基因组洪流。
Genome Biol. 2019 Sep 13;20(1):199. doi: 10.1186/s13059-019-1809-x.
Bioinformatics. 2019 Jan 15;35(2):219-226. doi: 10.1093/bioinformatics/bty611.
4
Bioconda: sustainable and comprehensive software distribution for the life sciences.生物conda:面向生命科学的可持续且全面的软件发行平台。
Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7.
5
Indexed variation graphs for efficient and accurate resistome profiling.索引变异图可实现高效准确的抗药基因谱分析。
Bioinformatics. 2018 Nov 1;34(21):3601-3608. doi: 10.1093/bioinformatics/bty387.
6
Similarity of the dog and human gut microbiomes in gene content and response to diet.狗和人类肠道微生物组在基因组成和对饮食的反应方面具有相似性。
Microbiome. 2018 Apr 19;6(1):72. doi: 10.1186/s40168-018-0450-3.
7
Optimisation of 16S rRNA gut microbiota profiling of extremely low birth weight infants.优化极低出生体重儿 16S rRNA 肠道微生物群分析。
BMC Genomics. 2017 Nov 2;18(1):841. doi: 10.1186/s12864-017-4229-x.
8
A communal catalogue reveals Earth's multiscale microbial diversity.一份公共目录揭示了地球的多尺度微生物多样性。
Nature. 2017 Nov 23;551(7681):457-463. doi: 10.1038/nature24621. Epub 2017 Nov 1.
9
Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.宏基因组解读的批判性评估——宏基因组学软件的一项基准测试
Nat Methods. 2017 Nov;14(11):1063-1071. doi: 10.1038/nmeth.4458. Epub 2017 Oct 2.
10
Mash: fast genome and metagenome distance estimation using MinHash.Mash:使用MinHash进行快速的基因组和宏基因组距离估计。
Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.