• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用连锁统计提高泛基因组分类准确性。

Improved pangenomic classification accuracy with chain statistics.

作者信息

Brown Nathaniel K, Shivakumar Vikram S, Langmead Ben

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore MD 21218.

出版信息

bioRxiv. 2024 Nov 2:2024.10.29.620953. doi: 10.1101/2024.10.29.620953.

DOI:10.1101/2024.10.29.620953
PMID:39554056
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11565826/
Abstract

Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference. We present a novel approach that additionally obtains coarse-grained co-linearity ("chain") statistics. We do this without using a chaining algorithm, which would require superlinear time in the number of matches. We start with a collection of strings, avoiding the multiple-alignment step required by graph approaches. We rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs. From these, we select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an -space index for a collection of sequences having a length- BWT consisting of maximal equal-character runs. Using the index, we simultaneously compute fine-grained matching statistics and coarse-grained chain statistics in linear time with respect to query length. We found that this substantially improves classification accuracy compared to past compressed-indexing approaches and reaches the same level of accuracy as less efficient alignment-based methods.

摘要

压缩全文索引能够针对泛基因组或生命树索引进行高效的序列分类。过去关于压缩索引分类的工作使用匹配统计或伪匹配长度来捕获精确匹配的细粒度共线性。但这些方法未能捕获关于种子在参考序列中是否共线出现的粗粒度信息。我们提出了一种新颖的方法,该方法还能获得粗粒度共线性(“链”)统计信息。我们无需使用链接算法来实现这一点,因为链接算法在匹配数量上需要超线性时间。我们从一组字符串开始,避免了图方法所需的多重比对步骤。我们快速计算多最大唯一匹配(multi-MUMs)并识别与这些多最大唯一匹配相对应的BWT子运行。从中,我们选择那些可以“隧穿”的,并使用相应的多最大唯一匹配标识符进行标记。这为长度为 的由 个最大等字符运行组成的BWT的 个序列集合生成了一个 空间索引。使用该索引,我们相对于查询长度在线性时间内同时计算细粒度匹配统计信息和粗粒度链统计信息。我们发现,与过去的压缩索引方法相比,这显著提高了分类准确率,并且达到了与效率较低的基于比对的方法相同的准确率水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/24d9d2c996c7/nihpp-2024.10.29.620953v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/27f9e5c37188/nihpp-2024.10.29.620953v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/3876241e6e23/nihpp-2024.10.29.620953v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/ce320c7080cb/nihpp-2024.10.29.620953v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/5b9004406331/nihpp-2024.10.29.620953v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/24d9d2c996c7/nihpp-2024.10.29.620953v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/27f9e5c37188/nihpp-2024.10.29.620953v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/3876241e6e23/nihpp-2024.10.29.620953v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/ce320c7080cb/nihpp-2024.10.29.620953v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/5b9004406331/nihpp-2024.10.29.620953v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cffe/11565826/24d9d2c996c7/nihpp-2024.10.29.620953v1-f0004.jpg

相似文献

1
Improved pangenomic classification accuracy with chain statistics.利用连锁统计提高泛基因组分类准确性。
bioRxiv. 2024 Nov 2:2024.10.29.620953. doi: 10.1101/2024.10.29.620953.
2
Haplotype Matching with GBWT for Pangenome Graphs.用于泛基因组图的基于广义布隆游走树的单倍型匹配
bioRxiv. 2025 Feb 7:2025.02.03.634410. doi: 10.1101/2025.02.03.634410.
3
Movi: a fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
bioRxiv. 2024 Feb 15:2023.11.04.565615. doi: 10.1101/2023.11.04.565615.
4
MEM-based pangenome indexing for -mer queries.基于MEM的用于k-mer查询的泛基因组索引
bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.
5
Finding maximal exact matches in graphs.在图中寻找最大精确匹配。
Algorithms Mol Biol. 2024 Mar 11;19(1):10. doi: 10.1186/s13015-024-00255-5.
6
Mem-based pangenome indexing for k-mer queries.用于k-mer查询的基于内存的泛基因组索引
Algorithms Mol Biol. 2025 Mar 1;20(1):3. doi: 10.1186/s13015-025-00272-y.
7
Cliffy: robust 16S rRNA classification based on a compressed LCA index.Cliffy:基于压缩的最低公共祖先(LCA)索引的稳健16S rRNA分类。
bioRxiv. 2024 May 30:2024.05.25.595899. doi: 10.1101/2024.05.25.595899.
8
Movi: A fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.
9
Compressed indexing and local alignment of DNA.DNA的压缩索引与局部比对
Bioinformatics. 2008 Mar 15;24(6):791-7. doi: 10.1093/bioinformatics/btn032. Epub 2008 Jan 28.
10
Run-length compressed metagenomic read classification with SMEM-finding and tagging.基于SMEM查找和标记的游程长度压缩宏基因组读取分类
bioRxiv. 2025 Mar 24:2025.02.25.640119. doi: 10.1101/2025.02.25.640119.

本文引用的文献

1
Movi: A fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.
2
Sigmoni: classification of nanopore signal with a compressed pangenome index.西格蒙尼:使用压缩泛基因组索引对纳米孔信号进行分类。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i287-i296. doi: 10.1093/bioinformatics/btae213.
3
r-indexing the eBWT.对增强型Burrows-Wheeler变换进行r索引
Int Symp String Process Inf Retr. 2021 Oct;12944:3-12. doi: 10.1007/978-3-030-86692-1_1. Epub 2021 Sep 27.
4
RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.RefSeq 与宏基因组时代的原核生物基因组注释流程。
Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988.
5
SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2:使用最小化消化物的泛基因组指数进行改进分类。
Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.
6
A draft human pangenome reference.人类泛基因组参考草图。
Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.
7
Accelerating Minimap2 for Accurate Long Read Alignment on GPUs.在GPU上加速Minimap2以实现准确的长读长比对
J Biotechnol Biomed. 2023;6(1):13-23. doi: 10.26502/jbb.2642-91280067. Epub 2023 Jan 20.
8
PBSIM3: a simulator for all types of PacBio and ONT long reads.PBSIM3:一款适用于所有类型的PacBio和ONT长读长的模拟器。
NAR Genom Bioinform. 2022 Dec 1;4(4):lqac092. doi: 10.1093/nargab/lqac092. eCollection 2022 Dec.
9
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
10
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.