• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于MEM的用于k-mer查询的泛基因组索引

MEM-based pangenome indexing for -mer queries.

作者信息

Hwang Stephen, Brown Nathaniel K, Ahmed Omar Y, Jenike Katharine M, Kovaka Sam, Schatz Michael C, Langmead Ben

机构信息

XDBio Program, Johns Hopkins University, Baltimore MD, USA.

Department of Computer Science, Johns Hopkins University, Baltimore MD, USA.

出版信息

bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.

DOI:10.1101/2024.05.20.595044
PMID:38826299
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11142109/
Abstract

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on -mers and de Bruijn graphs are limited to answering questions at a specific substring length . We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test -mer presence/absence (membership queries) and that count the number of genomes containing -mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5x faster than other approaches. MEMO's small index size, lack of -mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

摘要

由于高质量长读长组装技术的普及,泛基因组的数量和规模都在不断增加。然而,目前用于研究泛基因组内序列组成和保守性的方法存在局限性。基于图形泛基因组的方法需要一个计算成本高昂的多序列比对步骤,这可能会遗漏一些变异。基于k-mer和德布鲁因图的索引仅限于回答特定子串长度的问题。我们提出了最大精确匹配排序(MEMO),一种基于序列间最大精确匹配(MEMs)的泛基因组索引方法。单个MEMO索引可以处理跨越泛基因组窗口的任意长度查询。MEMO既支持测试k-mer存在与否的查询(成员查询),也支持统计窗口中包含k-mer的基因组数量的查询(保守性查询)。对于包含89个人类常染色体单倍型的泛基因组,MEMO索引占用2.04GB,比可比的KMC3索引小8.8倍,比PanKmer索引小11.4倍。通过牺牲一些计数分辨率,MEMO索引可以变得更小,我们的十分位数分辨率HPRC索引达到0.67GB。MEMO可以在13.89秒内对人类白细胞抗原基因座上的31-mer进行保守性查询,比其他方法快2.5倍。MEMO索引体积小、不依赖k-mer长度且查询效率高,使其成为研究和可视化泛基因组中子串保守性的灵活工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/dd7ba9a83e46/nihpp-2024.05.20.595044v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/cf5a51e1771d/nihpp-2024.05.20.595044v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/698fd7679fe5/nihpp-2024.05.20.595044v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/dd7ba9a83e46/nihpp-2024.05.20.595044v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/cf5a51e1771d/nihpp-2024.05.20.595044v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/698fd7679fe5/nihpp-2024.05.20.595044v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7bd5/11142109/dd7ba9a83e46/nihpp-2024.05.20.595044v1-f0003.jpg

相似文献

1
MEM-based pangenome indexing for -mer queries.基于MEM的用于k-mer查询的泛基因组索引
bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.
2
Mem-based pangenome indexing for k-mer queries.用于k-mer查询的基于内存的泛基因组索引
Algorithms Mol Biol. 2025 Mar 1;20(1):3. doi: 10.1186/s13015-025-00272-y.
3
Lossless indexing with counting de Bruijn graphs.基于计数型 de Bruijn 图的无损索引
Genome Res. 2022 Sep 27;32(9):1754-1764. doi: 10.1101/gr.276607.122.
4
PanKmer: k-mer-based and reference-free pangenome analysis.PanKmer:基于 k-mer 的无参考基因组泛基因组分析。
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad621.
5
Prokrustean Graph: A substring index for rapid k-mer size analysis.普罗克汝斯忒斯图:一种用于快速k-mer大小分析的子串索引。
bioRxiv. 2024 Dec 20:2023.11.21.568151. doi: 10.1101/2023.11.21.568151.
6
Compression Algorithm for Colored de Bruijn Graphs.彩色德布鲁因图的压缩算法
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.17. Epub 2023 Aug 29.
7
Fulgor: A fast and compact -mer index for large-scale matching and color queries.富尔戈尔:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。
bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.
8
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.
9
REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.驯鹿:测序数据集中小段序列存在和丰度的高效索引。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185. doi: 10.1093/bioinformatics/btaa487.
10
Haplotype Matching with GBWT for Pangenome Graphs.用于泛基因组图的基于广义布隆游走树的单倍型匹配
bioRxiv. 2025 Feb 7:2025.02.03.634410. doi: 10.1101/2025.02.03.634410.

本文引用的文献

1
A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range.69 个拟南芥品系的泛基因组揭示了全球物种范围内的保守基因组结构。
Nat Genet. 2024 May;56(5):982-991. doi: 10.1038/s41588-024-01715-9. Epub 2024 Apr 11.
2
PanKmer: k-mer-based and reference-free pangenome analysis.PanKmer:基于 k-mer 的无参考基因组泛基因组分析。
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad621.
3
The complete sequence of a human Y chromosome.人类 Y 染色体的完整序列。
Nature. 2023 Sep;621(7978):344-354. doi: 10.1038/s41586-023-06457-y. Epub 2023 Aug 23.
4
k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean.基于 k-mer 的 GWAS 可增强大豆因果变异和候选基因的发现。
Plant Genome. 2023 Dec;16(4):e20374. doi: 10.1002/tpg2.20374. Epub 2023 Aug 18.
5
Efficient taxa identification using a pangenome index.利用泛基因组索引进行高效的分类单元鉴定。
Genome Res. 2023 Jul;33(7):1069-1077. doi: 10.1101/gr.277642.123. Epub 2023 May 31.
6
SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2:使用最小化消化物的泛基因组指数进行改进分类。
Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.
7
Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes.人类白细胞抗原超级基因座:基因组超级基因、单核苷酸多态性、插入缺失、转录本和单倍型的枢纽
Hum Genome Var. 2022 Dec 21;9(1):49. doi: 10.1038/s41439-022-00226-5.
8
The Human Pangenome Project: a global resource to map genomic diversity.人类泛基因组计划:绘制基因组多样性图谱的全球资源。
Nature. 2022 Apr;604(7906):437-446. doi: 10.1038/s41586-022-04601-8. Epub 2022 Apr 20.
9
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
10
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.