Suppr超能文献

快速三聚体统计即使在测序错误率很高的情况下也有助于对大型随机DNA条形码集进行准确解码。

Fast trimer statistics facilitate accurate decoding of large random DNA barcode sets even at large sequencing error rates.

作者信息

Press William H

机构信息

Department of Computer Science and Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA.

出版信息

PNAS Nexus. 2022 Nov 4;1(5):pgac252. doi: 10.1093/pnasnexus/pgac252. eCollection 2022 Nov.

Abstract

Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.

摘要

预定义的短DNA序列集通常用作条形码,以识别混合群体中的单个生物分子。这种用途需要足够低的DNA错误率,或者一种错误校正方法。大多数现有的DNA纠错码(ECC)在通常≲10个条形码的集合中,每个条形码只能纠正一两个错误。我们在此考虑使用足够长的随机条形码,即使有≳6个错误,甚至在[公式:见正文]或20%的核苷酸错误率下,它们仍能被准确解码。我们表明,即使有≳10个条形码,长度约为34 nt也足够了。对该方案的一个明显反对意见是,它需要通过缓慢的莱文斯坦或尼德曼-翁施比较,将每个读数与每个可能的条形码进行比较。我们表明,可以通过以下方法实现几个数量级的加速:(i)一种快速分类方法,该方法仅比较三聚体(三个连续核苷酸)出现统计量,该统计量已针对读数和条形码在线性时间内预先计算;(ii)当今即使是消费级图形处理单元(GPU)也具备的大规模并行性。对于10个长度为34且DNA错误率为10%(替换和插入缺失)的条形码,我们在模拟中实现了99.9%的精度(解码准确率)和98.8%的召回率(读数接受率)。即使DNA错误率为20%,也可以实现类似的高精度,但召回率略低。在配备两个GPU(2022年的性能和价格)的消费级工作站上,每百万次解码读数的分摊计算成本估计在0.15美元至0.60美元之间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/017f/9802387/8b71257ea6ac/pgac252fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验