Qin Yun, Zhu Fei, Xi Bo, Song Lifu
Center for Applied Mathematics, Tianjin University, Tianjin, China.
Systems Biology Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China.
Comput Struct Biotechnol J. 2024 Mar 1;23:1076-1087. doi: 10.1016/j.csbj.2024.02.019. eCollection 2024 Dec.
DNA holds immense potential as an emerging data storage medium. However, the recovery of information in DNA storage systems faces challenges posed by various errors, including IDS errors, strand breaks, and rearrangements, inevitably introduced during synthesis, amplification, sequencing, and storage processes. Sequence reconstruction, crucial for decoding, involves inferring the DNA reference from a cluster of erroneous copies. While most methods assume equal contributions from all reads within a cluster as noisy copies of the same reference, they often overlook the existence of contaminated sequences caused by DNA breaks, rearrangements, or mis-clustering reads. To address this issue, we propose RobuSeqNet, a robust multi-read reconstruction neural network specifically designed to robustly reconstruct multiple reads, accommodating noisy clusters with strand breakage, rearrangements, and mis-clustered strands. Leveraging the attention mechanism and an elaborate network design, RobuSeqNet exhibits resilience to highly-noisy clusters and effectively deals with in-strand IDS errors. The effectiveness and robustness of the proposed method are validated on three representative next-generation sequencing datasets. Results demonstrate that RobuSeqNet maintains high sequence reconstruction success rates of 99.74%, 99.58%, and 96.44% across three datasets, even in the presence of noisy clusters containing up to 20% contaminated sequences, outperforming known sequence reconstruction models. Additionally, in scenarios without contaminated sequences, it exhibits comparable performance to existing models, achieving success rates of 99.88%, 99.82%, and 97.68% across the three datasets.
DNA作为一种新兴的数据存储介质具有巨大的潜力。然而,DNA存储系统中的信息恢复面临着各种错误带来的挑战,包括在合成、扩增、测序和存储过程中不可避免地引入的插入缺失(IDS)错误、链断裂和重排。序列重建对于解码至关重要,它涉及从一组错误副本中推断出DNA参考序列。虽然大多数方法将簇内所有读段同等视为同一参考序列的噪声副本,但它们往往忽略了由DNA断裂、重排或读段错误聚类导致的污染序列的存在。为了解决这个问题,我们提出了RobuSeqNet,这是一种强大的多读段重建神经网络,专门设计用于稳健地重建多个读段,以适应具有链断裂、重排和错误聚类链的噪声簇。利用注意力机制和精心设计的网络,RobuSeqNet对高噪声簇具有弹性,并能有效处理链内IDS错误。所提出方法的有效性和稳健性在三个具有代表性的下一代测序数据集上得到了验证。结果表明,即使在存在高达20%污染序列的噪声簇的情况下,RobuSeqNet在三个数据集上的序列重建成功率仍分别保持在99.74%、99.58%和96.44%,优于已知的序列重建模型。此外,在没有污染序列的情况下,它与现有模型表现相当,在三个数据集上的成功率分别为99.88%、99.82%和97.68%。