• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用泛基因组索引进行高效的分类单元鉴定。

Efficient taxa identification using a pangenome index.

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;

Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA.

出版信息

Genome Res. 2023 Jul;33(7):1069-1077. doi: 10.1101/gr.277642.123. Epub 2023 May 31.

DOI:10.1101/gr.277642.123
PMID:37258301
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10538492/
Abstract

Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The -index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where is the number of Burrows-Wheeler runs. To date, the -index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of documents, [Formula: see text] over an alphabet of size σ, we extend the -index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard -index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.

摘要

用于将测序reads 与参考序列数据库进行分类的工具需要高效的索引数据结构。-index 是一种压缩的全文索引,可以在与数据库中不同序列数量成比例的空间中回答子串存在/不存在、计数和定位查询:[公式:见正文]空间,其中 是 Burrows-Wheeler 运行的数量。到目前为止,-index 缺乏根据匹配所重叠的参考序列(或序列分组,即分类群)快速分类匹配的能力。我们提出了新的算法和方法来解决这个问题。具体来说,给定一个由 文档组成的集合 D,[公式:见正文]在大小为 σ 的字母表上,我们通过 [公式:见正文]个额外的单词扩展 -index,以支持针对在 D 中的 [公式:见正文]个文档中出现的模式 [公式:见正文]的文档列表查询,查询时间为 [公式:见正文],空间复杂度为 [公式:见正文],其中 是机器字长。在细菌模拟群落实验中应用时,我们的方法比使用标准 -index 定位查询的可比方法快三倍。我们表明,与其他方法相比,我们的方法在分类模拟和真实的纳米孔读取时具有更高的精度。最后,我们提出了在可以限制读取长度或匹配长度的应用程序中压缩此结构的策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/5428070c0931/1069f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/ca693ca582ac/1069f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/71c88efc1894/1069f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/18cff8777ba7/1069f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/5428070c0931/1069f04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/ca693ca582ac/1069f01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/71c88efc1894/1069f02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/18cff8777ba7/1069f03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b8b4/10538492/5428070c0931/1069f04.jpg

相似文献

1
Efficient taxa identification using a pangenome index.利用泛基因组索引进行高效的分类单元鉴定。
Genome Res. 2023 Jul;33(7):1069-1077. doi: 10.1101/gr.277642.123. Epub 2023 May 31.
2
Movi: a fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
bioRxiv. 2024 Feb 15:2023.11.04.565615. doi: 10.1101/2023.11.04.565615.
3
Cliffy: robust 16S rRNA classification based on a compressed LCA index.Cliffy:基于压缩的最低公共祖先(LCA)索引的稳健16S rRNA分类。
bioRxiv. 2024 May 30:2024.05.25.595899. doi: 10.1101/2024.05.25.595899.
4
Finding Maximal Exact Matches Using the r-Index.使用 r-索引查找最大精确匹配。
J Comput Biol. 2022 Feb;29(2):188-194. doi: 10.1089/cmb.2021.0445. Epub 2022 Jan 17.
5
On avoided words, absent words, and their application to biological sequence analysis.论避免出现的词、缺失的词及其在生物序列分析中的应用。
Algorithms Mol Biol. 2017 Mar 14;12:5. doi: 10.1186/s13015-017-0094-z. eCollection 2017.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
Efficient mapping of accurate long reads in minimizer space with mapquik.使用 mapquik 在 minimizer 空间中高效映射准确的长读段。
Genome Res. 2023 Jul;33(7):1188-1197. doi: 10.1101/gr.277679.123. Epub 2023 Jun 30.
8
Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.使用[公式:见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱
BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.
9
Efficient Computation of Longest Common Subsequences with Multiple Substring Inclusive Constraints.
J Comput Biol. 2019 Sep;26(9):938-947. doi: 10.1089/cmb.2019.0008. Epub 2019 Apr 8.
10
ET-Motif: Solving the Exact (l, d)-Planted Motif Problem Using Error Tree Structure.ET-基序:使用错误树结构解决精确的(l,d)植入基序问题
J Comput Biol. 2016 Jul;23(7):615-23. doi: 10.1089/cmb.2015.0238. Epub 2016 May 6.

引用本文的文献

1
Run-length compressed metagenomic read classification with SMEM-finding and tagging.基于SMEM查找和标记的游程长度压缩宏基因组读取分类
bioRxiv. 2025 Mar 24:2025.02.25.640119. doi: 10.1101/2025.02.25.640119.
2
Mem-based pangenome indexing for k-mer queries.用于k-mer查询的基于内存的泛基因组索引
Algorithms Mol Biol. 2025 Mar 1;20(1):3. doi: 10.1186/s13015-025-00272-y.
3
Movi: A fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。

本文引用的文献

1
SPUMONI 2: improved classification using a pangenome index of minimizer digests.SPUMONI 2:使用最小化消化物的泛基因组指数进行改进分类。
Genome Biol. 2023 May 18;24(1):122. doi: 10.1186/s13059-023-02958-1.
2
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
3
Pan-genomic matching statistics for targeted nanopore sequencing.靶向纳米孔测序的泛基因组匹配统计
iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.
4
MEM-based pangenome indexing for -mer queries.基于MEM的用于k-mer查询的泛基因组索引
bioRxiv. 2024 May 22:2024.05.20.595044. doi: 10.1101/2024.05.20.595044.
5
Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.离心机:用于高效准确的宏基因组序列分类的微生物基因组无损压缩。
Genome Biol. 2024 Apr 25;25(1):106. doi: 10.1186/s13059-024-03244-4.
6
Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.Centrifuger:用于高效准确的宏基因组序列分类的微生物基因组无损压缩
bioRxiv. 2023 Nov 17:2023.11.15.567129. doi: 10.1101/2023.11.15.567129.
7
Movi: a fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
bioRxiv. 2024 Feb 15:2023.11.04.565615. doi: 10.1101/2023.11.04.565615.
iScience. 2021 Jun 8;24(6):102696. doi: 10.1016/j.isci.2021.102696. eCollection 2021 Jun 25.
4
Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED.利用 UNCALLED 对原始电信号进行实时映射的靶向纳米孔测序。
Nat Biotechnol. 2021 Apr;39(4):431-441. doi: 10.1038/s41587-020-0731-9. Epub 2020 Nov 30.
5
PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores.PBSIM2:一种带有新型质量评分生成模型的长读测序模拟软件。
Bioinformatics. 2021 May 5;37(5):589-595. doi: 10.1093/bioinformatics/btaa835.
6
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.高效构建全基因组读段比对的完整索引。
J Comput Biol. 2020 Apr;27(4):500-513. doi: 10.1089/cmb.2019.0309. Epub 2020 Mar 16.
7
Improved metagenomic analysis with Kraken 2.Kraken 2 提升宏基因组分析。
Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.
8
High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.高通量 ANI 分析 9 万余组原核基因组揭示了清晰的物种界限。
Nat Commun. 2018 Nov 30;9(1):5114. doi: 10.1038/s41467-018-07641-9.
9
RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification.RefSeq 数据库的增长影响了基于 k-mer 的最低共同祖先物种鉴定的准确性。
Genome Biol. 2018 Oct 30;19(1):165. doi: 10.1186/s13059-018-1554-6.
10
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.