Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA.
Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
Genome Biol. 2020 Mar 17;21(1):71. doi: 10.1186/s13059-020-01988-3.
Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.
In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods.
In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
新一代测序技术的最新进展迅速提高了我们以空前规模研究基因组材料的能力。尽管测序技术有了实质性的改进,但数据中的错误仍然存在,这可能会混淆下游分析,并限制测序技术在临床工具中的适用性。计算纠错有望消除测序错误,但纠错算法的相对准确性仍不清楚。
在本文中,我们评估了错误纠正算法在包含不同程度异质性的不同类型数据集上纠正错误的能力。我们强调了计算错误纠正技术在免疫基因组学和病毒学等不同生物学领域的优势和局限性。为了展示我们技术的效果,我们应用基于 UMI 的高保真度测序方案从模拟数据和原始读数中消除测序错误。然后,我们对错误纠正方法进行了实际评估。
在准确性方面,我们发现方法性能在不同类型的数据集之间存在很大差异,没有一种方法在所有类型的检查数据上都表现最好。最后,我们还确定了在精度和灵敏度之间提供良好平衡的技术。