快速三聚体统计即使在测序错误率很高的情况下也有助于对大型随机DNA条形码集进行准确解码。

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates.

作者信息

Press William H

机构信息

Department of Computer Science and Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA.

出版信息

PNAS Nexus. 2022 Nov 4;1(5):pgac252. doi: 10.1093/pnasnexus/pgac252. eCollection 2022 Nov.

DOI:10.1093/pnasnexus/pgac252

PMID:36712375

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9802387/

Abstract

Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.

摘要

预定义的短DNA序列集通常用作条形码，以识别混合群体中的单个生物分子。这种用途需要足够低的DNA错误率，或者一种错误校正方法。大多数现有的DNA纠错码（ECC）在通常≲10个条形码的集合中，每个条形码只能纠正一两个错误。我们在此考虑使用足够长的随机条形码，即使有≳6个错误，甚至在[公式：见正文]或20%的核苷酸错误率下，它们仍能被准确解码。我们表明，即使有≳10个条形码，长度约为34 nt也足够了。对该方案的一个明显反对意见是，它需要通过缓慢的莱文斯坦或尼德曼-翁施比较，将每个读数与每个可能的条形码进行比较。我们表明，可以通过以下方法实现几个数量级的加速：（i）一种快速分类方法，该方法仅比较三聚体（三个连续核苷酸）出现统计量，该统计量已针对读数和条形码在线性时间内预先计算；（ii）当今即使是消费级图形处理单元（GPU）也具备的大规模并行性。对于10个长度为34且DNA错误率为10%（替换和插入缺失）的条形码，我们在模拟中实现了99.9%的精度（解码准确率）和98.8%的召回率（读数接受率）。即使DNA错误率为20%，也可以实现类似的高精度，但召回率略低。在配备两个GPU（2022年的性能和价格）的消费级工作站上，每百万次解码读数的分摊计算成本估计在0.15美元至0.60美元之间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8b71257ea6ac/pgac252fig1.jpg

相似文献

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates.快速三聚体统计即使在测序错误率很高的情况下也有助于对大型随机DNA条形码集进行准确解码。

PNAS Nexus. 2022 Nov 4;1(5):pgac252. doi: 10.1093/pnasnexus/pgac252. eCollection 2022 Nov.

Indel-correcting DNA barcodes for high-throughput sequencing.高通量测序的无错切 DNA 条形码。

Proc Natl Acad Sci U S A. 2018 Jul 3;115(27):E6217-E6226. doi: 10.1073/pnas.1802640115. Epub 2018 Jun 20.

Levenshtein error-correcting barcodes for multiplexed DNA sequencing.莱文斯坦纠错条码在多重 DNA 测序中的应用。

BMC Bioinformatics. 2013 Sep 11;14:272. doi: 10.1186/1471-2105-14-272.

Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing.Pheniqs 2.0：用于组合条码索引的准确、高性能贝叶斯解码和置信度估计。

BMC Bioinformatics. 2021 Jul 2;22(1):359. doi: 10.1186/s12859-021-04267-5.

Low-complexity and highly robust barcodes for error-rich single molecular sequencing.用于富含错误的单分子测序的低复杂度且高度稳健的条形码。

3 Biotech. 2021 Feb;11(2):78. doi: 10.1007/s13205-020-02607-5. Epub 2021 Jan 16.

Sequencing barcode construction and identification methods based on block error-correction codes.基于块纠错码的测序条码构建和识别方法。

Sci China Life Sci. 2020 Oct;63(10):1580-1592. doi: 10.1007/s11427-019-1651-3. Epub 2020 Apr 14.

Insertion and deletion correcting DNA barcodes based on watermarks.基于水印的插入和缺失校正DNA条形码

BMC Bioinformatics. 2015 Feb 18;16:50. doi: 10.1186/s12859-015-0482-7.

DNA Barcoding through Quaternary LDPC Codes.通过四元低密度奇偶校验码进行DNA条形码技术

PLoS One. 2015 Oct 22;10(10):e0140459. doi: 10.1371/journal.pone.0140459. eCollection 2015.

A MinION™-based pipeline for fast and cost-effective DNA barcoding.一种基于MinION™的快速且经济高效的DNA条形码分析流程。

Mol Ecol Resour. 2018 Apr 19. doi: 10.1111/1755-0998.12890.

Designing robust watermark barcodes for multiplex long-read sequencing.为多重长读长测序设计稳健的水印条形码。

Bioinformatics. 2017 Mar 15;33(6):807-813. doi: 10.1093/bioinformatics/btw322.

本文引用的文献

Robust and scalable barcoding for massively parallel long-read sequencing.高通量长读测序的稳健且可扩展的条形码技术。

Sci Rep. 2022 May 10;12(1):7619. doi: 10.1038/s41598-022-11656-0.

Synthetic DNA applications in information technology.信息技术中的合成 DNA 应用。

Nat Commun. 2022 Jan 17;13(1):352. doi: 10.1038/s41467-021-27846-9.

Sequencing DNA with nanopores: Troubles and biases.用纳米孔测序 DNA：问题和偏差。

PLoS One. 2021 Oct 1;16(10):e0257521. doi: 10.1371/journal.pone.0257521. eCollection 2021.

Chemical and photochemical error rates in light-directed synthesis of complex DNA libraries.复杂DNA文库光导向合成中的化学和光化学错误率

Nucleic Acids Res. 2021 Jul 9;49(12):6687-6701. doi: 10.1093/nar/gkab505.

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.用于 DNA 存储的 HEDGES 纠错码可纠正插入缺失，并允许序列约束。

Proc Natl Acad Sci U S A. 2020 Aug 4;117(31):18489-18496. doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.

Indel-correcting DNA barcodes for high-throughput sequencing.高通量测序的无错切 DNA 条形码。

Proc Natl Acad Sci U S A. 2018 Jul 3;115(27):E6217-E6226. doi: 10.1073/pnas.1802640115. Epub 2018 Jun 20.

Multiplexed gene synthesis in emulsions for exploring protein functional landscapes.乳液中多重基因的合成，用于探索蛋白质功能图谱。

Science. 2018 Jan 19;359(6373):343-347. doi: 10.1126/science.aao5167. Epub 2018 Jan 4.

Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens.高通量筛选中生物分子鉴定的大规模 DNA 条形码文库生成。

Sci Rep. 2017 Oct 24;7(1):13899. doi: 10.1038/s41598-017-12825-2.

Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis.太平洋生物科学公司和牛津纳米孔技术公司的全面比较及其在转录组分析中的应用。

F1000Res. 2017 Feb 3;6:100. doi: 10.12688/f1000research.10571.2. eCollection 2017.

A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.生物医学研究与临床应用单细胞RNA测序实用指南。

Genome Med. 2017 Aug 18;9(1):75. doi: 10.1186/s13073-017-0467-4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

快速三聚体统计即使在测序错误率很高的情况下也有助于对大型随机DNA条形码集进行准确解码。

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献