Movi：一种快速且缓存高效的全基因组索引。

Movi: A fast and cache-efficient full-text pangenome index.

作者信息

Zakeri Mohsen, Brown Nathaniel K, Ahmed Omar Y, Gagie Travis, Langmead Ben

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, US.

Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.

出版信息

iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.

DOI:10.1016/j.isci.2024.111464

PMID:39758981

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11696632/

Abstract

Pangenome indexes are promising tools for many applications, including classification of nanopore sequencing reads. Move structure is a compressed-index data structure based on the Burrows-Wheeler Transform (BWT). It offers simultaneous O(1)-time queries and O(r) space, where r is the number of BWT runs (consecutive sequence of identical characters). We developed Movi based on the move structure for indexing and querying pangenomes. Movi scales very well for repetitive text as its size grows strictly by r. Movi computes sophisticated matching queries for classification such as pseudo-matching lengths and backward search up to 30 times faster than existing methods by minimizing the number of cache misses and using memory prefetching to attain a degree of latency hiding. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.

摘要

泛基因组索引是适用于许多应用的有前景的工具，包括对纳米孔测序读数进行分类。移动结构是一种基于Burrows-Wheeler变换（BWT）的压缩索引数据结构。它提供了同时的O(1)时间查询和O(r)空间，其中r是BWT游程（相同字符的连续序列）的数量。我们基于移动结构开发了Movi，用于对泛基因组进行索引和查询。由于Movi的大小严格按r增长，因此它对于重复文本具有很好的扩展性。Movi通过最小化缓存未命中的数量并使用内存预取来实现一定程度的延迟隐藏，从而计算复杂的匹配查询以进行分类，例如伪匹配长度和反向搜索，速度比现有方法快30倍。Movi快速的常数时间查询循环使其非常适合实时应用，如纳米孔测序的自适应采样，在这种应用中必须在小且可预测的时间间隔内做出决策。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ae1/11696632/5940f8658d16/fx1.jpg

相似文献

Movi: A fast and cache-efficient full-text pangenome index.Movi：一种快速且缓存高效的全基因组索引。

iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.

Movi: a fast and cache-efficient full-text pangenome index.Movi：一种快速且缓存高效的全基因组索引。

bioRxiv. 2024 Feb 15:2023.11.04.565615. doi: 10.1101/2023.11.04.565615.

Efficient taxa identification using a pangenome index.利用泛基因组索引进行高效的分类单元鉴定。

Genome Res. 2023 Jul;33(7):1069-1077. doi: 10.1101/gr.277642.123. Epub 2023 May 31.

Haplotype Matching with GBWT for Pangenome Graphs.用于泛基因组图的基于广义布隆游走树的单倍型匹配

bioRxiv. 2025 Feb 7:2025.02.03.634410. doi: 10.1101/2025.02.03.634410.

Sigmoni: classification of nanopore signal with a compressed pangenome index.西格蒙尼：使用压缩泛基因组索引对纳米孔信号进行分类。

Bioinformatics. 2024 Jun 28;40(Suppl 1):i287-i296. doi: 10.1093/bioinformatics/btae213.

Sigmoni: classification of nanopore signal with a compressed pangenome index.西格莫尼：使用压缩全基因组索引对纳米孔信号进行分类。

bioRxiv. 2023 Aug 30:2023.08.15.553308. doi: 10.1101/2023.08.15.553308.

Cliffy: robust 16S rRNA classification based on a compressed LCA index.Cliffy：基于压缩的最低公共祖先（LCA）索引的稳健16S rRNA分类。

bioRxiv. 2024 May 30:2024.05.25.595899. doi: 10.1101/2024.05.25.595899.

BWT construction and search at the terabase scale.万亿碱基规模下的BWT构建与搜索。

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae717.

Indexing labeled sequences.对标记序列进行索引。

PeerJ Comput Sci. 2018 Mar 26;4:e148. doi: 10.7717/peerj-cs.148. eCollection 2018.

Improved pangenomic classification accuracy with chain statistics.利用连锁统计提高泛基因组分类准确性。

bioRxiv. 2024 Nov 2:2024.10.29.620953. doi: 10.1101/2024.10.29.620953.

引用本文的文献

b-move: faster lossless approximate pattern matching in a run-length compressed index.b移动：在游程长度压缩索引中实现更快的无损近似模式匹配。

Algorithms Mol Biol. 2025 Aug 12;20(1):15. doi: 10.1186/s13015-025-00281-x.

Environmental and Maternal Imprints on Infant Gut Metabolic Programming.环境和母体印记对婴儿肠道代谢编程的影响

bioRxiv. 2025 Jul 24:2025.07.24.666662. doi: 10.1101/2025.07.24.666662.

K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets.K2R：用于从测序数据集中高效提取 reads 的带颜色的德布鲁因图实现。

Bioinform Adv. 2025 May 14;5(1):vbaf111. doi: 10.1093/bioadv/vbaf111. eCollection 2025.

Mumemto: efficient maximal matching across pangenomes.Mumemto：跨泛基因组的高效最大匹配

Genome Biol. 2025 Jun 17;26(1):169. doi: 10.1186/s13059-025-03644-0.

Movi Color: fast and accurate long-read classification with the move structure.Movi Color：利用移动结构进行快速准确的长读长分类。

bioRxiv. 2025 May 27:2025.05.22.655637. doi: 10.1101/2025.05.22.655637.

Run-length compressed metagenomic read classification with SMEM-finding and tagging.基于SMEM查找和标记的游程长度压缩宏基因组读取分类

bioRxiv. 2025 Mar 24:2025.02.25.640119. doi: 10.1101/2025.02.25.640119.

ChIP provides 10-fold microbial DNA enrichment from tissue while minimizing bias.染色质免疫沉淀（ChIP）可从组织中富集10倍的微生物DNA，同时将偏差降至最低。

Mol Biol Rep. 2025 Feb 21;52(1):258. doi: 10.1007/s11033-025-10330-8.

Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data.不完整的人类参考基因组可能会导致错误的性别偏差，并在宏基因组数据中暴露患者识别信息。

Nat Commun. 2025 Jan 18;16(1):825. doi: 10.1038/s41467-025-56077-5.

Mumemto: efficient maximal matching across pangenomes.Mumemto：跨全基因组的高效最大匹配

bioRxiv. 2025 Jan 5:2025.01.05.631388. doi: 10.1101/2025.01.05.631388.

b-move: Faster Lossless Approximate Pattern Matching in a Run-Length Compressed Index.b移动：游程长度压缩索引中的更快无损近似模式匹配

Res Sq. 2024 Nov 18:rs.3.rs-5367343. doi: 10.21203/rs.3.rs-5367343/v1.

本文引用的文献

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.Fulgor：一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。

Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.

Efficient taxa identification using a pangenome index.利用泛基因组索引进行高效的分类单元鉴定。

Genome Res. 2023 Jul;33(7):1069-1077. doi: 10.1101/gr.277642.123. Epub 2023 May 31.

SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2：使用最小化消化物的泛基因组指数进行改进分类。

Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.

A draft human pangenome reference.人类泛基因组参考草图。

Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.

The complete sequence of a human genome.人类基因组的完整序列。

Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI：用于寻找最大精确匹配的泛基因组索引。

J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.

An optimized FM-index library for nucleotide and amino acid search.一个用于核苷酸和氨基酸搜索的优化FM索引库。

Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.

Pan-genomic matching statistics for targeted nanopore sequencing.靶向纳米孔测序的泛基因组匹配统计

iScience. 2021 Jun 8;24(6):102696. doi: 10.1016/j.isci.2021.102696. eCollection 2021 Jun 25.

Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED.利用 UNCALLED 对原始电信号进行实时映射的靶向纳米孔测序。

Nat Biotechnol. 2021 Apr;39(4):431-441. doi: 10.1038/s41587-020-0731-9. Epub 2020 Nov 30.

PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores.PBSIM2：一种带有新型质量评分生成模型的长读测序模拟软件。

Bioinformatics. 2021 May 5;37(5):589-595. doi: 10.1093/bioinformatics/btaa835.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Movi：一种快速且缓存高效的全基因组索引。

Movi: A fast and cache-efficient full-text pangenome index.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献