Zhu Zhongxu
Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang, 310022, China.
GigaByte. 2025 Feb 14;2025:gigabyte148. doi: 10.46471/gigabyte.148. eCollection 2025.
Nanopore sequencing, a third-generation sequencing technique, enables direct RNA sequencing, real-time analysis, and long-read length. Nanopore sequencers measure electrical current changes as nucleotides pass through nanopores; a basecaller identifies base sequences according to the raw current measurements. However, accurate basecalling remains challenging due to molecular variations and sequencing noise. Here, we introduce SqueezeCall, a novel Squeezeformer-based model for accurate nanopore basecalling. SqueezeCall uses convolution layers to down-sample raw signals and model local dependencies. A Squeezeformer network captures the global context, and a connectionist temporal classification (CTC) decoder with beam search generates DNA sequences. Experimental results demonstrated SqueezeCall's ability to resist noise, improving basecalling accuracy. We trained SqueezeCall combining three types of loss, and found that all three loss types contribute to basecalling accuracy. Experiments across multiple species demonstrated the potential of a Squeezeformer-based model to improve basecalling accuracy and its superiority over recurrent neural network-based models and Transformer-based models.
纳米孔测序是一种第三代测序技术,能够实现直接RNA测序、实时分析以及长读长测序。纳米孔测序仪在核苷酸通过纳米孔时测量电流变化;碱基识别软件根据原始电流测量值识别碱基序列。然而,由于分子变异和测序噪声,准确的碱基识别仍然具有挑战性。在此,我们介绍了SqueezeCall,一种基于新型Squeezeformer的用于准确纳米孔碱基识别的模型。SqueezeCall使用卷积层对原始信号进行下采样并对局部依赖性进行建模。一个Squeezeformer网络捕捉全局上下文,并且一个带有波束搜索的连接主义时间分类(CTC)解码器生成DNA序列。实验结果证明了SqueezeCall抵抗噪声的能力,提高了碱基识别的准确性。我们结合三种类型的损失训练了SqueezeCall,并发现所有三种损失类型都有助于提高碱基识别的准确性。跨多个物种的实验证明了基于Squeezeformer的模型提高碱基识别准确性的潜力及其相对于基于递归神经网络的模型和基于Transformer的模型的优越性。