WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran.
Department of Electrical, Computer and Biomedical Engineering, Ryerson University, Toronto, Canada.
Int J Biostat. 2023 May 8;19(2):439-453. doi: 10.1515/ijb-2021-0091. eCollection 2023 Nov 1.
Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and genomes, and Alzheimer's disease targeted experiment.
第三代测序技术,如 Pacific Biosciences 和 Oxford Nanopore,提供了比下一代测序更快、更经济、更简单的组装过程,生成的读取序列更长。然而,这些长读取的错误率高于短读取的错误率,因此在组装之前需要进行纠错过程,例如在 PacBio 测序仪中使用 Circular Consensus Sequencing (CCS) 读取。在本文中,我们提出了一个用于 CCS 读取中错误发生的概率模型。我们以子读取的数量为单位,获得了任意核苷酸的错误概率以及 CCS 读取中核苷酸的碱基调用 Phred 质量得分。此外,我们还推导出了与通过数有关的读取错误率分布。它遵循二项分布,对于长读取可以用正态分布来近似。最后,我们通过与三个真实的 PacBio 数据集,即 Lambda 和 基因组,以及阿尔茨海默病靶向实验进行比较,来评估我们提出的模型。