School of Computing Science, Faculty of Applied Sciences, Simon Fraser University, Burnaby BC, Canada.
Vancouver Prostate Centre, Vancouver BC, Canada.
Bioinformatics. 2019 Jun 1;35(11):1829-1836. doi: 10.1093/bioinformatics/bty888.
Next-Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, e.g. for detecting low allele frequency variations from circulating tumor DNA. Barcode tagging of DNA molecules with unique molecular identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are polymerase chain reaction (PCR) amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient.
We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments generated by substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of tentative false positive in downstream variation calling.
Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available at https://github.com/vpc-ccg/calib.
Supplementary data are available at Bioinformatics online.
下一代测序技术产生了大量基因组数据集,这些数据集的处理带来了许多挑战,包括测序错误的处理。这在癌症基因组学中尤为重要,例如,用于检测来自循环肿瘤 DNA 的低等位基因频率变异。使用独特分子标识符 (UMI) 对 DNA 分子进行条形码标记试图减轻测序错误;UMI 标记的分子进行聚合酶链反应 (PCR) 扩增,并且 UMI 标记分子的 PCR 拷贝独立测序。然而,PCR 和测序步骤可能会在测序读段中产生错误,这些错误可能位于条形码和/或 DNA 序列中。分析 UMI 标记的测序数据需要初始聚类步骤,目的是将从相同 UMI 标记分子的 PCR 重复物中测序的读段分组到单个聚类中,并且当前数据集的大小要求该聚类过程具有资源效率。
我们引入了 Calib,这是一种计算工具,用于对 Illumina 等替代错误主导测序平台生成的带有 UMI 标记的测序实验的成对末端读段进行聚类。Calib 聚类被定义为图的连通分量,其边是根据条形码相似性和读序列相似性定义的。该图使用局部敏感哈希和 MinHashing 技术高效构建。Calib 的默认聚类参数针对不同的 UMI 和读长进行了经验优化,使用了 Calib 中打包的模拟模块。与其他工具相比,Calib 在模拟数据上具有最佳的准确性,同时保持合理的运行时和内存占用。在真实数据集上,Calib 所需的资源比基于比对的方法少得多,并且其聚类减少了下游变异调用中潜在的假阳性数量。
Calib 是用 C++实现的,其模拟模块是用 Python 实现的。Calib 可在 https://github.com/vpc-ccg/calib 上获得。
补充数据可在 Bioinformatics 在线获得。