通过深度学习方法限制和筛选 DNA 存储中的高度二级结构序列。

Limit and screen sequences with high degree of secondary structures in DNA storage by deep learning method.

机构信息

Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China.

Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China; School of Computer Science of Information Technology, Qiannan Normal University for Nationalities, Duyun, Guizhou, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China.

出版信息

Comput Biol Med. 2023 Nov;166:107548. doi: 10.1016/j.compbiomed.2023.107548. Epub 2023 Oct 2.

DOI:10.1016/j.compbiomed.2023.107548

PMID:37801922

Abstract

BACKGROUND

In single-stranded DNAs/RNAs, secondary structures are very common especially in long sequences. It has been recognized that the high degree of secondary structures in DNA sequences could interfere with the correct writing and reading of information in DNA storage. However, how to circumvent its side-effect is seldom studied.

METHOD

As the degree of secondary structures of DNA sequences is closely related to the magnitude of the free energy released in the complicated folding process, we first investigate the free-energy distribution at different encoding lengths based on randomly generated DNA sequences. Then, we construct a bidirectional long short-term (BiLSTM)-attention deep learning model to predict the free energy of sequences.

RESULTS

Our simulation results indicate that the free energy of DNA sequences at a specific length follows a right skewed distribution and the mean increases as the length increases. Given a tolerable free energy threshold of 20 kcal/mol, we could control the ratio of serious secondary structures in the encoding sequences to within 1% of the significant level through selecting a feasible encoding length of 100 nt. Compared with traditional deep learning models, the proposed model could achieve a better prediction performance both in the mean relative error (MRE) and the coefficient of determination (R). It achieved MRE = 0.109 and R = 0.918 respectively in the simulation experiment. The combination of the BiLSTM and attention module can handle the long-term dependencies and capture the feature of base pairing. Further, the prediction has a linear time complexity which is suitable for detecting sequences with severe secondary structures in future large-scale applications. Finally, 70 of 94 predicted free energy can be screened out on a real dataset. It demonstrates that the proposed model could screen out some highly suspicious sequences which are prone to produce more errors and low sequencing copies.

摘要

背景

在单链 DNA/RNA 中，二级结构非常常见，尤其是在长序列中。人们已经认识到，DNA 序列中的高度二级结构可能会干扰 DNA 存储中信息的正确书写和读取。然而，如何规避其副作用却很少被研究。

方法

由于 DNA 序列的二级结构程度与在复杂折叠过程中释放的自由能大小密切相关，我们首先根据随机生成的 DNA 序列研究不同编码长度的自由能分布。然后，我们构建了一个双向长短时记忆（BiLSTM）-注意力深度学习模型来预测序列的自由能。

结果

我们的模拟结果表明，特定长度的 DNA 序列的自由能遵循右偏分布，平均值随着长度的增加而增加。给定 20 kcal/mol 的可容忍自由能阈值，我们可以通过选择可行的 100 nt 编码长度将严重二级结构在编码序列中的比例控制在 1%的显著水平内。与传统的深度学习模型相比，所提出的模型在平均相对误差（MRE）和决定系数（R）方面都能实现更好的预测性能。在模拟实验中，它分别达到了 0.109 和 0.918 的 MRE 和 R。BiLSTM 和注意力模块的组合可以处理长期依赖性并捕获碱基配对的特征。此外，预测具有线性时间复杂度，适用于在未来的大规模应用中检测具有严重二级结构的序列。最后，在真实数据集上可以筛选出 94 个预测自由能中的 70 个。这表明所提出的模型可以筛选出一些容易产生更多错误和低测序副本的高度可疑序列。