Department of Computer Science, The University of Texas at Austin, Austin, TX 78712;
Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712.
Proc Natl Acad Sci U S A. 2020 Aug 4;117(31):18489-18496. doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.
Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.
合成 DNA 迅速成为一种耐用、高密度的信息存储平台。基于 DNA 的信息编码策略面临的一个主要挑战是,在 DNA 合成和测序过程中会产生很高的错误率。在这里,我们描述了 HEDGES(哈希编码,通过贪婪穷尽搜索解码)纠错码,它可以修复所有三种基本类型的 DNA 错误:插入、缺失和替换。HEDGES 还将未解决或复合错误转换为替换,通过交错在链上的标准 Reed-Solomon 外码恢复纠错同步。此外,HEDGES 可以包含广泛的用户定义序列约束,例如避免过度重复,或过高或过低的窗口鸟嘌呤-胞嘧啶(GC)含量。我们通过计算机模拟和合成 DNA 对我们的代码进行了测试。根据其测量性能,我们开发了一个适用于更大数据集的统计模型。预测性能表明,有可能从 DNA 中恢复无错误的数据,这些 DNA 经过降解后,错误率高达 10%。随着 DNA 合成和测序成本的持续下降,我们预计 HEDGES 将在大规模无错误信息编码中得到应用。