Wang Penghao, Mu Ziniu, Sun Lijun, Si Shuqing, Wang Bin
The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, China.
Front Bioeng Biotechnol. 2022 Jul 19;10:916615. doi: 10.3389/fbioe.2022.916615. eCollection 2022.
DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes.
DNA是一种天然存储介质,与传统介质相比,具有存储密度高、使用寿命长的优点。DNA存储能够满足当前对海量数据的存储需求。由于DNA存储技术的局限性,数据需要被转换为短DNA序列进行存储。然而,在此过程中,为了索引短DNA序列会产生大量的物理冗余。为了减少冗余,本研究提出了一种具有隐藏寻址的DNA存储编码方案。利用改进的喷泉编码方案,索引替换部分数据以实现隐藏地址,然后,使用隐藏寻址对一个10.1MB的文件进行编码。首先,通过Dottup点图生成器和杰卡德相似系数分析编码序列索引的整体自相似性,然后使用GC含量的序列片段来验证该方案的性能。最终结果表明,该编码方案索引的整体自相似性较低,序列的局部热力学性质较好。所提出的隐藏寻址编码方案不仅可以提高碱基利用率,还能在测序和解码过程中保证DNA存储的正确率。