Park Jiyeon, Jeon Ha Hyeon, Lee Jeong Wook, Park Hosung
Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju 61186, South Korea.
Department of Chemical Engineering, POSTECH, Pohang 37673, South Korea.
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf335.
Error detection/correction codes play an important role to reduce writing and/or reading costs in DNA data storage. Sequence analysis algorithms also make a crucial effect on error correction but have been executed independently from the decoding of error correction codes. In conventional sequence analysis, low-quality reads are usually discarded. For DNA data storage, low-quality reads can be constructively used to sequence analysis with the assistance of error detection/correction codes.
We obtained the low-quality reads which failed to pass the chastity filter in Illumina NGS sequencing. We confirmed the effectiveness of the extra low-quality reads by providing error statistics and performing decoding with them. We proposed a sequence clustering algorithm for various-length reads and a consensus algorithm based on probabilistic majority and error detection to efficiently exploit the extra reads. The proposed methods reduced the reading cost by 6.83% on average and up to 19.67% while maintaining the writing cost.
https://github.com/PParkJy/SAD-DNAstorage (10.5281/zenodo.15571858).
错误检测/纠正码在降低DNA数据存储中的写入和/或读取成本方面发挥着重要作用。序列分析算法对错误纠正也有至关重要的影响,但一直独立于错误纠正码的解码执行。在传统的序列分析中,低质量读数通常会被丢弃。对于DNA数据存储,在错误检测/纠正码的辅助下,低质量读数可被有效地用于序列分析。
我们获取了在Illumina NGS测序中未通过纯度筛选的低质量读数。我们通过提供错误统计信息并使用它们进行解码,证实了这些额外低质量读数的有效性。我们提出了一种针对各种长度读数的序列聚类算法,以及一种基于概率多数和错误检测的一致性算法,以有效地利用这些额外读数。所提出的方法在保持写入成本的同时,平均将读取成本降低了6.83%,最高可达19.67%。
https://github.com/PParkJy/SAD-DNAstorage (10.5281/zenodo.15571858) 。