Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland.
Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA.
Nat Protoc. 2020 Jan;15(1):86-101. doi: 10.1038/s41596-019-0244-5. Epub 2019 Nov 29.
Because of its longevity and enormous information density, DNA is considered a promising data storage medium. In this work, we provide instructions for archiving digital information in the form of DNA and for subsequently retrieving it from the DNA. In principle, information can be represented in DNA by simply mapping the digital information to DNA and synthesizing it. However, imperfections in synthesis, sequencing, storage and handling of the DNA induce errors within the molecules, making error-free information storage challenging. The procedure discussed here enables error-free storage by protecting the information using error-correcting codes. Specifically, in this protocol, we provide the technical details and precise instructions for translating digital information to DNA sequences, physically handling the biomolecules, storing them and subsequently re-obtaining the information by sequencing the DNA. Along with the protocol, we provide computer code that automatically encodes digital information to DNA sequences and decodes the information back from DNA to a digital file. The required software is provided on a Github repository. The protocol relies on commercial DNA synthesis and DNA sequencing via Illumina dye sequencing, and requires 1-2 h of preparation time, 1/2 d for sequencing preparation and 2-4 h for data analysis. This protocol focuses on storage scales of ~100 kB to 15 MB, offering an ideal starting point for small experiments. It can be augmented to enable higher data volumes and random access to the data and also allows for future sequencing and synthesis technologies, by changing the parameters of the encoder/decoder to account for the corresponding error rates.
由于 DNA 具有长寿命和巨大的信息密度,因此被认为是一种有前途的数据存储介质。在这项工作中,我们提供了将数字信息以 DNA 的形式存档并随后从 DNA 中检索它的说明。原则上,可以通过将数字信息简单地映射到 DNA 并合成它来在 DNA 中表示信息。然而,在 DNA 的合成、测序、存储和处理过程中存在的不完美会导致分子内出现错误,使得无错误的信息存储具有挑战性。这里讨论的过程通过使用纠错码来保护信息,从而实现无错误的存储。具体来说,在这个方案中,我们提供了将数字信息转换为 DNA 序列、物理处理生物分子、存储它们以及随后通过测序 DNA 重新获取信息的技术细节和精确说明。除了方案本身,我们还提供了自动将数字信息编码为 DNA 序列并将信息从 DNA 解码回数字文件的计算机代码。所需的软件在一个 Github 存储库中提供。该方案依赖于商业 DNA 合成和通过 Illumina 染料测序进行 DNA 测序,需要 1-2 小时的准备时间、1/2 天的测序准备时间和 2-4 小时的数据分析时间。该方案专注于 100 kB 到 15 MB 的存储规模,为小实验提供了一个理想的起点。通过改变编码器/解码器的参数来适应相应的错误率,可以对其进行扩充以实现更高的数据量和对数据的随机访问,并且还允许未来的测序和合成技术。