Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712.
Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.
Proc Natl Acad Sci U S A. 2018 Jul 3;115(27):E6217-E6226. doi: 10.1073/pnas.1802640115. Epub 2018 Jun 20.
Many large-scale, high-throughput experiments use DNA barcodes, short DNA sequences prepended to DNA libraries, for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely used error-correcting codes borrowed from computer science (e.g., Hamming, Levenshtein codes) do not properly account for insertions and deletions (indels) in DNA barcodes, even though deletions are the most common type of synthesis error. Here, we present and experimentally validate filled/truncated right end edit (FREE) barcodes, which correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced guanine-cytosine (GC) content, minimal homopolymer runs, and reduced internal hairpin propensity. We generate and include lists of barcodes with different lengths and error correction levels that may be useful in diverse high-throughput applications, including >10 single-error-correcting 16-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space combinatorially, generating lists with >10 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community.
许多大规模、高通量的实验都使用 DNA 条码(在 DNA 文库前添加的短 DNA 序列)来识别混合生物分子群体中的个体。然而,DNA 合成和测序错误会混淆对观察到的条码的正确解释,并可能导致大量数据丢失或产生虚假结果。广泛应用于计算机科学的纠错码(例如汉明码、莱文斯坦码)并不能正确处理 DNA 条码中的插入和缺失(indels),尽管缺失是最常见的合成错误类型。在这里,我们提出并实验验证了填充/截断右末端编辑(FREE)条码,它可以纠正替换、插入和删除错误,即使这些错误改变了条码长度。FREE 条码的设计考虑了实验因素,包括平衡的鸟嘌呤-胞嘧啶(GC)含量、最小的同源聚合物序列和减少的内部发夹倾向。我们生成并包含了具有不同长度和纠错水平的条码列表,这些列表可能在各种高通量应用中有用,包括>10 个具有单错误纠正功能的 16 -mer,它们在解码准确性、条码长度和文库大小之间取得了平衡。此外,将两个或更多 FREE 码串联成一个条码可以组合地增加可用的条码空间,生成具有>10 个纠错条码的列表。用于创建条码库和解码测序条码的包含的软件高效且设计为方便一般生物学界使用。