Zeng Jingwen, Cai Hongmin, Peng Hong, Wang Haiyan, Zhang Yue, Akutsu Tatsuya
School of Computer Science and Engineering, South China University of Technology, Guangzhou, China.
School of Computer Science, Guangdong Plytechnic Normal University, Guangzhou, China.
Front Genet. 2020 Jan 20;10:1332. doi: 10.3389/fgene.2019.01332. eCollection 2019.
Nanopore sequencing is promising because of its long read length and high speed. During sequencing, a strand of DNA/RNA passes through a biological nanopore, which causes the current in the pore to fluctuate. During basecalling, context-dependent current measurements are translated into the base sequence of the DNA/RNA strand. Accurate and fast basecalling is vital for downstream analyses such as genome assembly and detecting single-nucleotide polymorphisms and genomic structural variants. However, owing to the various changes in DNA/RNA molecules, noise during sequencing, and limitations of basecalling methods, accurate basecalling remains a challenge. In this paper, we propose Causalcall, which uses an end-to-end temporal convolution-based deep learning model for accurate and fast nanopore basecalling. Developed on a temporal convolutional network (TCN) and a connectionist temporal classification decoder, Causalcall directly identifies base sequences of varying lengths from current measurements in long time series. In contrast to the basecalling models using recurrent neural networks (RNNs), the convolution-based model of Causalcall can speed up basecalling by matrix computation. Experiments on multiple species have demonstrated the great potential of the TCN-based model to improve basecalling accuracy and speed when compared to an RNN-based model. Besides, experiments on genome assembly indicate the utility of Causalcall in reference-based genome assembly.
纳米孔测序因其长读长和高速度而颇具前景。在测序过程中,一条DNA/RNA链穿过一个生物纳米孔,这会导致孔内电流波动。在碱基识别过程中,依赖上下文的电流测量值被转化为DNA/RNA链的碱基序列。准确且快速的碱基识别对于诸如基因组组装以及检测单核苷酸多态性和基因组结构变异等下游分析至关重要。然而,由于DNA/RNA分子的各种变化、测序过程中的噪声以及碱基识别方法的局限性,准确的碱基识别仍然是一项挑战。在本文中,我们提出了Causalcall,它使用基于端到端时间卷积的深度学习模型进行准确且快速的纳米孔碱基识别。Causalcall基于时间卷积网络(TCN)和联结主义时间分类解码器开发,直接从长时间序列中的电流测量值识别不同长度的碱基序列。与使用递归神经网络(RNN)的碱基识别模型相比,Causalcall基于卷积的模型可以通过矩阵计算加快碱基识别速度。在多个物种上进行的实验表明,与基于RNN的模型相比,基于TCN的模型在提高碱基识别准确性和速度方面具有巨大潜力。此外,在基因组组装上的实验表明了Causalcall在基于参考的基因组组装中的效用。