• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用布隆过滤器高效计数 DNA 序列中的 k-mer。

Efficient counting of k-mers in DNA sequences using a bloom filter.

机构信息

Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA.

出版信息

BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.

DOI:10.1186/1471-2105-12-333
PMID:21831268
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3166945/
Abstract

BACKGROUND

Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.

RESULTS

We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors.

CONCLUSIONS

A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html.

摘要

背景

计数 k-mer(DNA 序列数据中长度为 k 的子字符串)是生物信息学中许多方法的基本组成部分,包括基因组和转录组组装、宏基因组测序以及序列读取的错误纠正。尽管原则上很简单,但在大型现代序列数据集上计数 k-mer 很容易超出标准计算机的内存容量。在当前的数据集,大量的存储容量-通常超过 50%-可能用于存储包含测序错误的 k-mer,这些 k-mer 通常在数据中只观察到一次。这些单例 k-mer 对于许多没有某种错误纠正的算法来说是无信息的。

结果

我们提出了一种新的方法,可以识别 DNA 序列数据集中出现多次的所有 k-mer。我们的方法使用布隆过滤器(一种概率数据结构)来实现这一点,该结构使用内存隐式存储所有观察到的 k-mer,从而大大减少了内存需求。然后,我们再次遍历数据,提供所有非唯一 k-mer 的精确计数。例如,对于数据集,与当前软件相比,我们报告内存使用量节省了高达 50%,而计算速度的成本适中。对于任何从具有错误的序列数据开始计数 k-mer 的算法,这种方法都可以减少内存需求。

结论

这种方法的参考实现,BFCounter,是用 C++编写的,并且是 GPL 许可的。它可在 http://pritch.bsd.uchicago.edu/bfcounter.html 免费下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/c4d39d359214/1471-2105-12-333-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/324f4a635cd4/1471-2105-12-333-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/b7ac843420d9/1471-2105-12-333-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/bc840ff22178/1471-2105-12-333-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/7c2b316bc43d/1471-2105-12-333-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/c4d39d359214/1471-2105-12-333-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/324f4a635cd4/1471-2105-12-333-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/b7ac843420d9/1471-2105-12-333-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/bc840ff22178/1471-2105-12-333-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/7c2b316bc43d/1471-2105-12-333-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b36/3166945/c4d39d359214/1471-2105-12-333-5.jpg

相似文献

1
Efficient counting of k-mers in DNA sequences using a bloom filter.使用布隆过滤器高效计数 DNA 序列中的 k-mer。
BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.
2
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.一种快速、无锁的方法,用于高效并行计数 k-mer 的出现次数。
Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.
3
Turtle: identifying frequent k-mers with cache-efficient algorithms.海龟:使用缓存高效算法识别频繁的 k-mer。
Bioinformatics. 2014 Jul 15;30(14):1950-7. doi: 10.1093/bioinformatics/btu132. Epub 2014 Mar 10.
4
These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.这些不是你要找的k-mer:使用概率数据结构进行高效在线k-mer计数。
PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271. eCollection 2014.
5
DSK: k-mer counting with very low memory usage.DSK:使用极低内存进行 k-mer 计数。
Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.
6
Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.使用k-mer布隆过滤器提高序列数据上的布隆过滤器性能。
J Comput Biol. 2017 Jun;24(6):547-557. doi: 10.1089/cmb.2016.0155. Epub 2016 Nov 9.
7
A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.一种通用的、近精确的低内存消耗 k-mer 计数方法,可在 2.7 小时内完成 106×人类序列数据的从头组装。
Bioinformatics. 2020 Dec 30;36(Suppl_2):i625-i633. doi: 10.1093/bioinformatics/btaa890.
8
Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.通过查询 K -mer 的固定采样和索引 K-mer 的布隆过滤实现最大精确匹配的快速检测。
Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.
9
QuorUM: An Error Corrector for Illumina Reads.QuorUM:Illumina测序读数的纠错工具
PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.
10
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

引用本文的文献

1
K-mer-based Approaches to Bridging Pangenomics and Population Genetics.基于K-mer的泛基因组学与群体遗传学关联方法。
Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.
2
Determining population structure from k-mer frequencies.从k-mer频率确定群体结构。
PeerJ. 2025 Mar 5;13:e18939. doi: 10.7717/peerj.18939. eCollection 2025.
3
Diversity and evolution of viroids and viroid-like agents with circular RNA genomes revealed by metatranscriptome mining.通过宏转录组挖掘揭示的具有环状RNA基因组的类病毒和类病毒样因子的多样性与进化

本文引用的文献

1
Succinct data structures for assembling large genomes.用于组装大型基因组的简明数据结构。
Bioinformatics. 2011 Feb 15;27(4):479-86. doi: 10.1093/bioinformatics/btq697. Epub 2011 Jan 17.
2
Multiplexed shotgun genotyping for rapid and efficient genetic mapping.多重散弹枪基因分型技术用于快速高效的遗传作图。
Genome Res. 2011 Apr;21(4):610-7. doi: 10.1101/gr.115402.110. Epub 2011 Jan 13.
3
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.一种快速、无锁的方法,用于高效并行计数 k-mer 的出现次数。
Nucleic Acids Res. 2025 Jan 24;53(3). doi: 10.1093/nar/gkae1278.
4
Advances in Protein-Ligand Binding Affinity Prediction via Deep Learning: A Comprehensive Study of Datasets, Data Preprocessing Techniques, and Model Architectures.基于深度学习的蛋白质-配体结合亲和力预测方法进展:数据集、数据预处理技术和模型架构的综合研究。
Curr Drug Targets. 2024;25(15):1041-1065. doi: 10.2174/0113894501330963240905083020.
5
PanKA: Leveraging population pangenome to predict antibiotic resistance.PanKA:利用群体泛基因组预测抗生素耐药性。
iScience. 2024 Aug 2;27(9):110623. doi: 10.1016/j.isci.2024.110623. eCollection 2024 Sep 20.
6
A survey of k-mer methods and applications in bioinformatics.生物信息学中k-mer方法及其应用综述。
Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.
7
A deep learning method for drug-target affinity prediction based on sequence interaction information mining.基于序列交互信息挖掘的药物-靶标亲和力预测深度学习方法。
PeerJ. 2023 Dec 11;11:e16625. doi: 10.7717/peerj.16625. eCollection 2023.
8
Stacking-ac4C: an ensemble model using mixed features for identifying n4-acetylcytidine in mRNA.Stacking-ac4C:一种使用混合特征的集成模型,用于识别 mRNA 中的 N4-乙酰胞苷。
Front Immunol. 2023 Nov 29;14:1267755. doi: 10.3389/fimmu.2023.1267755. eCollection 2023.
9
SAKE: Strobemer-assisted k-mer extraction.SAKE:频闪辅助 k-mer 提取。
PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.
10
Poaceae Chloroplast Genome Sequencing: Great Leap Forward in Recent Ten Years.禾本科叶绿体基因组测序:近十年的巨大飞跃。
Curr Genomics. 2023 Feb 14;23(6):369-384. doi: 10.2174/1389202924666221201140603.
Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.
4
High-quality draft assemblies of mammalian genomes from massively parallel sequence data.利用大规模平行测序数据生成高质量的哺乳动物基因组草图组装。
Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. doi: 10.1073/pnas.1017351108. Epub 2010 Dec 27.
5
Quake: quality-aware detection and correction of sequencing errors.Quake:测序错误的质量感知检测和校正。
Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.
6
A map of human genome variation from population-scale sequencing.人类基因组变异的图谱来自于基于人群的测序。
Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534.
7
Classification of DNA sequences using Bloom filters.使用布隆过滤器对 DNA 序列进行分类。
Bioinformatics. 2010 Jul 1;26(13):1595-600. doi: 10.1093/bioinformatics/btq230. Epub 2010 May 13.
8
A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware.一种用于在支持CUDA的图形硬件上对高通量短读数据进行纠错的并行算法。
J Comput Biol. 2010 Apr;17(4):603-15. doi: 10.1089/cmb.2009.0062.
9
De novo assembly of human genomes with massively parallel short read sequencing.利用大规模平行短读测序进行人类基因组从头组装。
Genome Res. 2010 Feb;20(2):265-72. doi: 10.1101/gr.097261.109. Epub 2009 Dec 17.
10
The sequence and de novo assembly of the giant panda genome.大熊猫基因组的序列与从头组装。
Nature. 2010 Jan 21;463(7279):311-7. doi: 10.1038/nature08696. Epub 2009 Dec 13.