Yan Zihui, Zhang Haoran, Lu Boyuan, Han Tong, Tong Xiaoguang, Yuan Yingjin
Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.
Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China.
Natl Sci Rev. 2024 Sep 10;12(1):nwae321. doi: 10.1093/nsr/nwae321. eCollection 2025 Jan.
The long-term preservation of large volumes of infrequently accessed cold data poses challenges to the storage community. Deoxyribonucleic acid (DNA) is considered a promising solution due to its inherent physical stability and significant storage density. The information density and decoding sequence coverage are two important metrics that influence the efficiency of DNA data storage. In this study, we propose a novel coding scheme called the DNA palette code, which is suitable for cold data, especially time-series archival datasets. These datasets are not frequently accessed, but require reliable long-term storage for retrospective research. The DNA palette code employs unordered combinations of index-free oligonucleotides to represent binary information. It can achieve high net information density encoding and lossless decoding with low sequencing coverage. When sequencing reads are corrupted, it can still effectively recover partial information, preventing the complete failure of file retrieval. The testing of clinical brain magnetic resonance imaging (MRI) data storage, as well as simulation validations using large-scale public MRI datasets (10 GB), planetary science datasets and meteorological datasets, demonstrates the advantages of our coding scheme, including high net information density, low decoding sequence coverage and wide applicability.
大量不常访问的冷数据的长期保存给存储领域带来了挑战。脱氧核糖核酸(DNA)因其固有的物理稳定性和显著的存储密度而被视为一种有前景的解决方案。信息密度和解码序列覆盖率是影响DNA数据存储效率的两个重要指标。在本研究中,我们提出了一种名为DNA调色板编码的新型编码方案,它适用于冷数据,特别是时间序列存档数据集。这些数据集不常被访问,但需要可靠的长期存储以用于回顾性研究。DNA调色板编码采用无索引寡核苷酸的无序组合来表示二进制信息。它可以实现高净信息密度编码和低测序覆盖率下的无损解码。当测序读数受损时,它仍然可以有效地恢复部分信息,防止文件检索完全失败。对临床脑磁共振成像(MRI)数据存储的测试,以及使用大规模公共MRI数据集(10GB)、行星科学数据集和气象数据集的模拟验证,证明了我们编码方案的优势,包括高净信息密度、低解码序列覆盖率和广泛的适用性。