Press William H
Department of Computer Science and Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA.
PNAS Nexus. 2022 Nov 4;1(5):pgac252. doi: 10.1093/pnasnexus/pgac252. eCollection 2022 Nov.
Predefined sets of short DNA sequences are commonly used as barcodes to identify individual biomolecules in pooled populations. Such use requires either sufficiently small DNA error rates, or else an error-correction methodology. Most existing DNA error-correcting codes (ECCs) correct only one or two errors per barcode in sets of typically ≲10 barcodes. We here consider the use of random barcodes of sufficient length that they remain accurately decodable even with ≳6 errors and even at [Formula: see text] or 20% nucleotide error rates. We show that length ∼34 nt is sufficient even with ≳10 barcodes. The obvious objection to this scheme is that it requires comparing every read to every possible barcode by a slow Levenshtein or Needleman-Wunsch comparison. We show that several orders of magnitude speedup can be achieved by (i) a fast triage method that compares only trimer (three consecutive nucleotide) occurence statistics, precomputed in linear time for both reads and barcodes, and (ii) the massive parallelism available on today's even commodity-grade Graphics Processing Units (GPUs). With 10 barcodes of length 34 and 10% DNA errors (substitutions and indels), we achieve in simulation 99.9% precision (decode accuracy) with 98.8% recall (read acceptance rate). Similarly high precision with somewhat smaller recall is achievable even with 20% DNA errors. The amortized computation cost on a commodity workstation with two GPUs (2022 capability and price) is estimated as between US$ 0.15 and US$ 0.60 per million decoded reads.
预定义的短DNA序列集通常用作条形码,以识别混合群体中的单个生物分子。这种用途需要足够低的DNA错误率,或者一种错误校正方法。大多数现有的DNA纠错码(ECC)在通常≲10个条形码的集合中,每个条形码只能纠正一两个错误。我们在此考虑使用足够长的随机条形码,即使有≳6个错误,甚至在[公式:见正文]或20%的核苷酸错误率下,它们仍能被准确解码。我们表明,即使有≳10个条形码,长度约为34 nt也足够了。对该方案的一个明显反对意见是,它需要通过缓慢的莱文斯坦或尼德曼-翁施比较,将每个读数与每个可能的条形码进行比较。我们表明,可以通过以下方法实现几个数量级的加速:(i)一种快速分类方法,该方法仅比较三聚体(三个连续核苷酸)出现统计量,该统计量已针对读数和条形码在线性时间内预先计算;(ii)当今即使是消费级图形处理单元(GPU)也具备的大规模并行性。对于10个长度为34且DNA错误率为10%(替换和插入缺失)的条形码,我们在模拟中实现了99.9%的精度(解码准确率)和98.8%的召回率(读数接受率)。即使DNA错误率为20%,也可以实现类似的高精度,但召回率略低。在配备两个GPU(2022年的性能和价格)的消费级工作站上,每百万次解码读数的分摊计算成本估计在0.15美元至0.60美元之间。