通过一系列质量控制工具提高双链测序数据的产量。

Increased yields of duplex sequencing data by a series of quality control tools.

作者信息

Povysil Gundula, Heinzl Monika, Salazar Renato, Stoler Nicholas, Nekrutenko Anton, Tiemann-Boege Irene

机构信息

Institute of Biophysics, Johannes Kepler University, 4020 Linz, Austria.

Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.

出版信息

NAR Genom Bioinform. 2021 Feb 9;3(1):lqab002. doi: 10.1093/nargab/lqab002. eCollection 2021 Mar.

DOI:10.1093/nargab/lqab002

PMID:33575654

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7872198/

Abstract

Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

摘要

双链测序是目前通过将来自同一DNA分子的序列读数分组到具有正向和反向链信息的家族中来识别超低频DNA变异的最可靠方法。然而，只有一小部分读数被组装成双链一致序列（DCS），并且具有潜在有价值信息的读数在生物信息学流程的不同步骤中被丢弃，特别是没有家族的读数。我们开发了一套生物信息学工具集，用于分析标签和家族组成，目的是了解数据丢失情况并进行修改，以最大限度地提高变异检测的数据输出。具体而言，我们的工具表明标签包含聚合酶链反应和测序错误，这些错误会导致数据丢失并降低DCS产量。我们的工具还识别出嵌合体，这可能反映了条形码碰撞。最后，我们还开发了一种工具，该工具从原始读数中重新检查变异检测结果，并提供不同的汇总数据，通过基于层级的系统对变异检测的置信水平进行分类。使用这个工具，我们可以纳入没有家族的读数并检查检测的可靠性，这大大增加了变异检测的测序深度，这对于低输入样本或低覆盖区域来说是一个特别重要的优势。