• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

快速三聚体统计即使在测序错误率很高的情况下也有助于对大型随机DNA条形码集进行准确解码。

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates.

作者信息

Press William H

机构信息

Department of Computer Science and Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA.

出版信息

PNAS Nexus. 2022 Nov 4;1(5):pgac252. doi: 10.1093/pnasnexus/pgac252. eCollection 2022 Nov.

DOI:10.1093/pnasnexus/pgac252
PMID:36712375
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9802387/
Abstract

Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.

摘要

预定义的短DNA序列集通常用作条形码,以识别混合群体中的单个生物分子。这种用途需要足够低的DNA错误率,或者一种错误校正方法。大多数现有的DNA纠错码(ECC)在通常≲10个条形码的集合中,每个条形码只能纠正一两个错误。我们在此考虑使用足够长的随机条形码,即使有≳6个错误,甚至在[公式:见正文]或20%的核苷酸错误率下,它们仍能被准确解码。我们表明,即使有≳10个条形码,长度约为34 nt也足够了。对该方案的一个明显反对意见是,它需要通过缓慢的莱文斯坦或尼德曼-翁施比较,将每个读数与每个可能的条形码进行比较。我们表明,可以通过以下方法实现几个数量级的加速:(i)一种快速分类方法,该方法仅比较三聚体(三个连续核苷酸)出现统计量,该统计量已针对读数和条形码在线性时间内预先计算;(ii)当今即使是消费级图形处理单元(GPU)也具备的大规模并行性。对于10个长度为34且DNA错误率为10%(替换和插入缺失)的条形码,我们在模拟中实现了99.9%的精度(解码准确率)和98.8%的召回率(读数接受率)。即使DNA错误率为20%,也可以实现类似的高精度,但召回率略低。在配备两个GPU(2022年的性能和价格)的消费级工作站上,每百万次解码读数的分摊计算成本估计在0.15美元至0.60美元之间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8ccf464d816e/pgac252fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8b71257ea6ac/pgac252fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/166927c4c342/pgac252fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/e1c714269dff/pgac252fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/a296708beeed/pgac252fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8ccf464d816e/pgac252fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8b71257ea6ac/pgac252fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/166927c4c342/pgac252fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/e1c714269dff/pgac252fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/a296708beeed/pgac252fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8ccf464d816e/pgac252fig5.jpg

相似文献

1
Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates.快速三聚体统计即使在测序错误率很高的情况下也有助于对大型随机DNA条形码集进行准确解码。
PNAS Nexus. 2022 Nov 4;1(5):pgac252. doi: 10.1093/pnasnexus/pgac252. eCollection 2022 Nov.
2
Indel-correcting DNA barcodes for high-throughput sequencing.高通量测序的无错切 DNA 条形码。
Proc Natl Acad Sci U S A. 2018 Jul 3;115(27):E6217-E6226. doi: 10.1073/pnas.1802640115. Epub 2018 Jun 20.
3
Levenshtein error-correcting barcodes for multiplexed DNA sequencing.莱文斯坦纠错条码在多重 DNA 测序中的应用。
BMC Bioinformatics. 2013 Sep 11;14:272. doi: 10.1186/1471-2105-14-272.
4
Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing.Pheniqs 2.0:用于组合条码索引的准确、高性能贝叶斯解码和置信度估计。
BMC Bioinformatics. 2021 Jul 2;22(1):359. doi: 10.1186/s12859-021-04267-5.
5
Low-complexity and highly robust barcodes for error-rich single molecular sequencing.用于富含错误的单分子测序的低复杂度且高度稳健的条形码。
3 Biotech. 2021 Feb;11(2):78. doi: 10.1007/s13205-020-02607-5. Epub 2021 Jan 16.
6
Sequencing barcode construction and identification methods based on block error-correction codes.基于块纠错码的测序条码构建和识别方法。
Sci China Life Sci. 2020 Oct;63(10):1580-1592. doi: 10.1007/s11427-019-1651-3. Epub 2020 Apr 14.
7
Insertion and deletion correcting DNA barcodes based on watermarks.基于水印的插入和缺失校正DNA条形码
BMC Bioinformatics. 2015 Feb 18;16:50. doi: 10.1186/s12859-015-0482-7.
8
DNA Barcoding through Quaternary LDPC Codes.通过四元低密度奇偶校验码进行DNA条形码技术
PLoS One. 2015 Oct 22;10(10):e0140459. doi: 10.1371/journal.pone.0140459. eCollection 2015.
9
A MinION™-based pipeline for fast and cost-effective DNA barcoding.一种基于MinION™的快速且经济高效的DNA条形码分析流程。
Mol Ecol Resour. 2018 Apr 19. doi: 10.1111/1755-0998.12890.
10
Designing robust watermark barcodes for multiplex long-read sequencing.为多重长读长测序设计稳健的水印条形码。
Bioinformatics. 2017 Mar 15;33(6):807-813. doi: 10.1093/bioinformatics/btw322.

本文引用的文献

1
Robust and scalable barcoding for massively parallel long-read sequencing.高通量长读测序的稳健且可扩展的条形码技术。
Sci Rep. 2022 May 10;12(1):7619. doi: 10.1038/s41598-022-11656-0.
2
Synthetic DNA applications in information technology.信息技术中的合成 DNA 应用。
Nat Commun. 2022 Jan 17;13(1):352. doi: 10.1038/s41467-021-27846-9.
3
Sequencing DNA with nanopores: Troubles and biases.用纳米孔测序 DNA:问题和偏差。
PLoS One. 2021 Oct 1;16(10):e0257521. doi: 10.1371/journal.pone.0257521. eCollection 2021.
4
Chemical and photochemical error rates in light-directed synthesis of complex DNA libraries.复杂DNA文库光导向合成中的化学和光化学错误率
Nucleic Acids Res. 2021 Jul 9;49(12):6687-6701. doi: 10.1093/nar/gkab505.
5
HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.用于 DNA 存储的 HEDGES 纠错码可纠正插入缺失,并允许序列约束。
Proc Natl Acad Sci U S A. 2020 Aug 4;117(31):18489-18496. doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.
6
Indel-correcting DNA barcodes for high-throughput sequencing.高通量测序的无错切 DNA 条形码。
Proc Natl Acad Sci U S A. 2018 Jul 3;115(27):E6217-E6226. doi: 10.1073/pnas.1802640115. Epub 2018 Jun 20.
7
Multiplexed gene synthesis in emulsions for exploring protein functional landscapes.乳液中多重基因的合成,用于探索蛋白质功能图谱。
Science. 2018 Jan 19;359(6373):343-347. doi: 10.1126/science.aao5167. Epub 2018 Jan 4.
8
Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens.高通量筛选中生物分子鉴定的大规模 DNA 条形码文库生成。
Sci Rep. 2017 Oct 24;7(1):13899. doi: 10.1038/s41598-017-12825-2.
9
Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis.太平洋生物科学公司和牛津纳米孔技术公司的全面比较及其在转录组分析中的应用。
F1000Res. 2017 Feb 3;6:100. doi: 10.12688/f1000research.10571.2. eCollection 2017.
10
A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.生物医学研究与临床应用单细胞RNA测序实用指南。
Genome Med. 2017 Aug 18;9(1):75. doi: 10.1186/s13073-017-0467-4.