RepairNatrix：用于处理DNA存储的DNA测序数据的Snakemake工作流程。

RepairNatrix: a Snakemake workflow for processing DNA sequencing data for DNA storage.

作者信息

Schwarz Peter Michael, Welzel Marius, Heider Dominik, Freisleben Bernd

机构信息

Department of Mathematics and Computer Science, University of Marburg, Marburg 35032, Germany.

出版信息

Bioinform Adv. 2023 Aug 26;3(1):vbad117. doi: 10.1093/bioadv/vbad117. eCollection 2023.

DOI:10.1093/bioadv/vbad117

PMID:38496344

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10941317/

Abstract

MOTIVATION

There has been rapid progress in the development of error-correcting and constrained codes for DNA storage systems in recent years. However, improving the steps for processing raw sequencing data for DNA storage has a lot of untapped potential for further progress. In particular, constraints can be used as prior information to improve the processing of DNA sequencing data. Furthermore, a workflow tailored to DNA storage codes enables fair comparisons between different approaches while leading to reproducible results.

RESULTS

We present RepairNatrix, a read-processing workflow for DNA storage. RepairNatrix supports preprocessing of raw sequencing data for DNA storage applications and can be used to flag and heuristically repair constraint-violating sequences to further increase the recoverability of encoded data in the presence of errors. Compared to a preprocessing strategy without repair functionality, RepairNatrix reduced the number of raw reads required for the successful, error-free decoding of the input files by a factor of 25-35 across different datasets.

AVAILABILITY AND IMPLEMENTATION

RepairNatrix is available on Github: https://github.com/umr-ds/repairnatrix.

摘要

动机

近年来，用于DNA存储系统的纠错码和约束码的开发取得了快速进展。然而，改进DNA存储原始测序数据的处理步骤在进一步发展方面有许多未被挖掘的潜力。特别是，约束可以用作先验信息来改进DNA测序数据的处理。此外，针对DNA存储码量身定制的工作流程能够在不同方法之间进行公平比较，同时产生可重复的结果。

结果

我们展示了RepairNatrix，一种用于DNA存储的读取处理工作流程。RepairNatrix支持对用于DNA存储应用的原始测序数据进行预处理，并且可用于标记和启发式修复违反约束的序列，以在存在错误的情况下进一步提高编码数据的可恢复性。与没有修复功能的预处理策略相比，RepairNatrix在不同数据集中将成功无错误解码输入文件所需的原始读取数量减少了25至35倍。