Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China.
Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China.
Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae363.
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.
随着数字数据的指数级增长,我们迫切需要创新的存储介质和技术。由于其稳定性、存储容量和密度,DNA 分子为信息存储提供了一个很有前途的解决方案。然而,DNA 存储也面临着许多挑战,如复杂的生化限制和编码效率。本文提出了 Explorer,这是一种基于 De Bruijn 图的高效 DNA 编码算法,利用其对局部序列进行特征描述的能力。Explorer 可以在各种生化限制下进行编码,例如同聚物、GC 含量和不需要的基序。本文还介绍了 Codeformer,这是一种基于变压器架构的快速解码算法,以进一步提高解码效率。数值实验表明,与其他先进算法相比,Explorer 不仅在各种生化限制下实现了稳定的编码和解码,而且将编码效率和比特率提高了 10%。此外,Codeformer 还展示了高效解码大量 DNA 序列的能力。在不同的参数设置下,其解码效率比传统算法高出两倍以上。当 Codeformer 与 Reed-Solomon 码结合使用时,其解码准确率超过 99%,使其成为高速解码应用的理想选择。这些进展有望推动 DNA 为基础的数据存储系统的发展,并更广泛地探索 DNA 作为一种新型信息存储介质的可能性。