Libra：一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.

机构信息

Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA.

Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA.

出版信息

Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.

DOI:10.1093/gigascience/giy165

PMID:30597002

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6354030/

Abstract

BACKGROUND

Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content.

RESULTS

We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community.

CONCLUSIONS

A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

摘要

背景

shotgun 宏基因组学为微生物群落生物多样性和功能提供了强大的见解。然而，宏基因组研究的推论往往受到数据集大小和复杂性的限制，并受到现有数据库的可用性和完整性的限制。从头比较宏基因组学能够基于它们的总遗传内容来比较宏基因组。

结果

我们开发了一种名为 Libra 的工具，它可以对宏基因组进行全对全比较，根据它们的 k-mer 含量进行精确聚类。Libra 使用可扩展的 Hadoop 框架进行大规模的宏基因组比较，使用序列组成和丰度计算距离的余弦相似度，同时为测序深度标准化，以及在 iMicrobe 中进行基于网络的实现（http://imicrobe.us），该工具使用 CyVerse 先进的网络基础设施来促进科学界广泛使用该工具。

结论

使用模拟和真实宏基因组数据集对 Libra 与等效工具进行比较，范围从 8000 万到 42 亿个读数，表明为减少大数据集的计算时间而通常实施的方法，如数据缩减、读数计数标准化和存在/不存在距离度量，大大降低了大规模比较分析的分辨率。相比之下，Libra 使用 Hadoop 架构中的所有读数来计算 k-mer 丰度，该架构可以扩展到任何大小的数据集，以实现全球规模的分析并将微生物特征与生物过程联系起来。

相似文献

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.Libra：一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。

Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.

ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads.ViraPipe：用于从下一代测序读取中进行病毒宏基因组分析的可扩展并行管道。

Bioinformatics. 2018 Mar 15;34(6):928-935. doi: 10.1093/bioinformatics/btx702.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.AﬁTbin：一种基于初始和末端核苷酸的基于聚合 l-mer 频率的宏基因组序列拼接方法。

BMC Bioinformatics. 2024 Jul 16;25(1):241. doi: 10.1186/s12859-024-05859-7.

COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets.认知器：宏基因组数据集功能注释框架

PLoS One. 2015 Nov 11;10(11):e0142102. doi: 10.1371/journal.pone.0142102. eCollection 2015.

iMicrobe: Tools and data-dreaiven discovery platform for the microbiome sciences.iMicrobe：微生物组科学的工具和数据驱动发现平台。

Gigascience. 2019 Jul 1;8(7). doi: 10.1093/gigascience/giz083.

Estimating the total genome length of a metagenomic sample using k-mers.利用 k- -mer 估算宏基因组样本的总基因组长度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):183. doi: 10.1186/s12864-019-5467-x.

Quality control of microbiota metagenomics by k-mer analysis.通过k-mer分析进行微生物群落宏基因组学的质量控制

BMC Genomics. 2015 Mar 14;16(1):183. doi: 10.1186/s12864-015-1406-7.

引用本文的文献

Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.

K-mer-based Approaches to Bridging Pangenomics and Population Genetics.基于K-mer的泛基因组学与群体遗传学关联方法。

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.

Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.通过k-mer频率计数将序列组成信息整合到微生物多样性分析中。

mSystems. 2025 Mar 18;10(3):e0155024. doi: 10.1128/msystems.01550-24. Epub 2025 Feb 20.

Microbial communities associated with marine sponges from diverse geographic locations harbor biosynthetic novelty.来自不同地理位置的与海洋海绵相关的微生物群落蕴藏着生物合成新特性。

Appl Environ Microbiol. 2024 Dec 18;90(12):e0072624. doi: 10.1128/aem.00726-24. Epub 2024 Nov 20.

Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

A Tale of Two Seasons: Distinct Seasonal Viral Communities in a Thermokarst Lake.两个季节的故事：热喀斯特湖中不同的季节性病毒群落

Microorganisms. 2023 Feb 8;11(2):428. doi: 10.3390/microorganisms11020428.

A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures.基于 k-mer 的宏基因组距离与基于系统发育信息的 β 多样性测度之间的便捷对应关系。

PLoS Comput Biol. 2023 Jan 6;19(1):e1010821. doi: 10.1371/journal.pcbi.1010821. eCollection 2023 Jan.

Practical selection of representative sets of RNA-seq samples using a hierarchical approach.使用层次方法对 RNA-seq 样本进行有代表性的集合的实际选择。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i334-i341. doi: 10.1093/bioinformatics/btab315.

Biases in Viral Metagenomics-Based Detection, Cataloguing and Quantification of Bacteriophage Genomes in Human Faeces, a Review.基于病毒宏基因组学的人类粪便中噬菌体基因组检测、编目和定量的偏差，综述

Microorganisms. 2021 Mar 4;9(3):524. doi: 10.3390/microorganisms9030524.

Gut Microbiota in Dholes During Estrus.发情期亚洲野犬的肠道微生物群

Front Microbiol. 2020 Nov 30;11:575731. doi: 10.3389/fmicb.2020.575731. eCollection 2020.

本文引用的文献

Bioinformatics applications on Apache Spark.基于 Apache Spark 的生物信息学应用。

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

16S rRNA gene sequencing on a benchtop sequencer: accuracy for identification of clinically important bacteria.台式测序仪上的 16S rRNA 基因测序：对临床重要细菌鉴定的准确性。

J Appl Microbiol. 2017 Dec;123(6):1584-1596. doi: 10.1111/jam.13590. Epub 2017 Nov 7.

Bringing your tools to CyVerse Discovery Environment using Docker.使用Docker将你的工具带入CyVerse发现环境。

F1000Res. 2016 Jun 21;5:1442. doi: 10.12688/f1000research.8935.1. eCollection 2016.

A new view of the tree of life.生命之树的新视角。

Nat Microbiol. 2016 Apr 11;1:16048. doi: 10.1038/nmicrobiol.2016.48.

Mash: fast genome and metagenome distance estimation using MinHash.Mash：使用MinHash进行快速的基因组和宏基因组距离估计。

Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data.MetaFast：基于图的快速无参考鸟枪法宏基因组数据比较

Bioinformatics. 2016 Sep 15;32(18):2760-7. doi: 10.1093/bioinformatics/btw312. Epub 2016 Jun 3.

SimLoRD: Simulation of Long Read Data.SimLoRD：长读长数据模拟

Bioinformatics. 2016 Sep 1;32(17):2704-6. doi: 10.1093/bioinformatics/btw286. Epub 2016 May 10.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.用于宏基因组差异分析的k-mer谱适用性评估。

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

Ocean plankton. Patterns and ecological drivers of ocean viral communities.海洋浮游生物。海洋病毒群落的模式和生态驱动因素。

Science. 2015 May 22;348(6237):1261498. doi: 10.1126/science.1261498.

Ocean plankton. Structure and function of the global ocean microbiome.海洋浮游生物。全球海洋微生物组的结构和功能。

Science. 2015 May 22;348(6237):1261359. doi: 10.1126/science.1261359.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Libra：一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献