用于富含错误的单分子测序的低复杂度且高度稳健的条形码。

Low-complexity and highly robust barcodes for error-rich single molecular sequencing.

作者信息

Chen Weigang, Wang Panpan, Wang Lixia, Zhang Dalu, Han Mingzhe, Han Mingyong, Song Lifu

机构信息

School of Microelectronics, Tianjin University, Tianjin, 300072 People's Republic of China.

Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, 300072 People's Republic of China.

出版信息

3 Biotech. 2021 Feb;11(2):78. doi: 10.1007/s13205-020-02607-5. Epub 2021 Jan 16.

DOI:10.1007/s13205-020-02607-5

PMID:33505833

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7811498/

Abstract

DNA barcodes are frequently corrupted due to insertion, deletion, and substitution errors during DNA synthesis, amplification and sequencing, resulting in index hopping. In this paper, we propose a new DNA barcode construction scheme that combines a cyclic block code with a predetermined pseudo-random sequence bit by bit to form bit pairs, and then converts the bit pairs to bases, i.e., the DNA barcodes. Then, we present a barcode identification scheme for noisy sequencing reads, which uses a combination of cyclic shifting and traditional dynamic programming to mark the insertion and deletion positions, and then performs erasure-and-error-correction decoding on the corrupted codewords. Furthermore, we verify the identification error rate of barcodes for multiple errors and evaluate the reliability of the barcodes in DNA context. This method can be easily generalized for constructing long barcodes, which may be used in scenarios with serious errors. Simulation results show that the bit error rate after identifying insertions/deletions is greatly reduced using the combination of cyclic shift and dynamic programming compared to using dynamic programming only. It indicates that the proposed method can effectively improve the accuracy for estimating insertion/deletion errors. And the overall identification error rate of the proposed method is lower than when the probability of each base mutation is less than 0.1, which is the typical scenario in third-generation sequencing.

摘要

DNA条形码在DNA合成、扩增和测序过程中经常因插入、缺失和替换错误而损坏，导致索引跳跃。在本文中，我们提出了一种新的DNA条形码构建方案，该方案将循环分组码与预定的伪随机序列逐位组合形成位对，然后将位对转换为碱基，即DNA条形码。然后，我们提出了一种针对噪声测序读数的条形码识别方案，该方案使用循环移位和传统动态规划的组合来标记插入和缺失位置，然后对损坏的码字进行擦除和纠错解码。此外，我们验证了条形码对多个错误的识别错误率，并评估了DNA环境中条形码的可靠性。该方法可以很容易地推广用于构建长条形码，可用于错误严重的场景。仿真结果表明，与仅使用动态规划相比，使用循环移位和动态规划的组合在识别插入/缺失后的误码率大大降低。这表明所提出的方法可以有效地提高估计插入/缺失错误的准确性。并且当每个碱基突变的概率小于0.1时，所提出方法的总体识别错误率低于，这是第三代测序中的典型情况。