Jeong Jaeho, Park Seong-Joon, Kim Jae-Won, No Jong-Seon, Jeon Ha Hyeon, Lee Jeong Wook, No Albert, Kim Sunghwan, Park Hosung
Department of Electrical and Computer Engineering, Seoul National University, Institute of New Media and Communications (INMC), Seoul 08826, South Korea.
Department of Electronic Engineering, Gyeongsang National University, Engineering Research Institute, Jinju 52828, South Korea.
Bioinformatics. 2021 Oct 11;37(19):3136-3143. doi: 10.1093/bioinformatics/btab246.
In DNA storage systems, there are tradeoffs between writing and reading costs. Increasing the code rate of error-correcting codes may save writing cost, but it will need more sequence reads for data retrieval. There is potentially a way to improve sequencing and decoding processes in such a way that the reading cost induced by this tradeoff is reduced without increasing the writing cost. In past researches, clustering, alignment and decoding processes were considered as separate stages but we believe that using the information from all these processes together may improve decoding performance. Actual experiments of DNA synthesis and sequencing should be performed because simulations cannot be relied on to cover all error possibilities in practical circumstances.
For DNA storage systems using fountain code and Reed-Solomon (RS) code, we introduce several techniques to improve the decoding performance. We designed the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, RS error correction as well as detection and quality score-based ordering of sequences. We synthesized 513.6 KB data into DNA oligo pools and sequenced this data successfully with Illumina MiSeq instrument. Compared to Erlich's research, the proposed decoding method additionally incorporates sequence reads with minor errors which had been discarded before, and thus was able to make use of 10.6-11.9% more sequence reads from the same sequencing environment, this resulted in 6.5-8.9% reduction in the reading cost. Channel characteristics including sequence coverage and read-length distributions are provided as well.
The raw data files and the source codes of our experiments are available at: https://github.com/jhjeong0702/dna-storage.
在DNA存储系统中,写入成本和读取成本之间存在权衡。提高纠错码的码率可能会节省写入成本,但数据检索时需要更多的序列读取。可能有一种方法可以改进测序和解码过程,从而在不增加写入成本的情况下降低这种权衡所带来的读取成本。在过去的研究中,聚类、比对和解码过程被视为独立的阶段,但我们认为将所有这些过程中的信息一起使用可能会提高解码性能。由于无法依靠模拟来涵盖实际情况下所有的错误可能性,因此应该进行DNA合成和测序的实际实验。
对于使用喷泉码和里德 - 所罗门(RS)码的DNA存储系统,我们引入了几种技术来提高解码性能。我们设计了解码过程,重点关注关键组件的协作:基于汉明距离的聚类、丢弃异常序列读取、RS纠错以及基于检测和质量分数的序列排序。我们将513.6 KB数据合成到DNA寡核苷酸池中,并使用Illumina MiSeq仪器成功对该数据进行了测序。与埃利希的研究相比,所提出的解码方法额外纳入了之前被丢弃的带有小错误的序列读取,因此能够在相同的测序环境中多利用10.6 - 11.9%的序列读取,这使得读取成本降低了6.5 - 8.9%。还提供了包括序列覆盖度和读取长度分布在内的通道特征。
我们实验的原始数据文件和源代码可在以下网址获取:https://github.com/jhjeong0702/dna-storage 。