Li Qingwen, Sun Chen, Wang Daqian, Lou Jizhong
Key Laboratory of Epigenetic Regulation and Intervention, Center for Excellence in Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Comput Struct Biotechnol J. 2024 Sep 25;23:3430-3444. doi: 10.1016/j.csbj.2024.09.016. eCollection 2024 Dec.
Nanopore sequencing provides a rapid, convenient and high-throughput solution for nucleic acid sequencing. Accurate basecalling in nanopore sequencing is crucial for downstream analysis. Traditional approaches such as Hidden Markov Models (HMM), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN) have improved basecalling accuracy but there is a continuous need for higher accuracy and reliability. In this study, we introduce BaseNet (https://github.com/liqingwen98/BaseNet), an open-source toolkit that utilizes transformer models for advanced signal decoding in nanopore sequencing. BaseNet incorporates both autoregressive and non-autoregressive transformer-based decoding mechanisms, offering state-of-the-art algorithms freely accessible for future improvement. Our research indicates that cross-attention weights effectively map the relationship between current signals and base sequences, joint loss training through adding a pair of forward and reverse decoder facilitate model converge, and large-scale pre-trained models achieve superior decoding accuracy. This study helps to advance the field of nanopore sequencing signal decoding, contributes to technological advancements, and provides novel concepts and tools for researchers and practitioners.
纳米孔测序为核酸测序提供了一种快速、便捷且高通量的解决方案。纳米孔测序中的准确碱基识别对于下游分析至关重要。诸如隐马尔可夫模型(HMM)、循环神经网络(RNN)和卷积神经网络(CNN)等传统方法提高了碱基识别准确性,但仍持续需要更高的准确性和可靠性。在本研究中,我们介绍了BaseNet(https://github.com/liqingwen98/BaseNet),这是一个开源工具包,它利用变压器模型在纳米孔测序中进行高级信号解码。BaseNet结合了基于自回归和非自回归变压器的解码机制,提供了可免费获取的最先进算法以供未来改进。我们的研究表明,交叉注意力权重有效地映射了当前信号与碱基序列之间的关系,通过添加一对前向和反向解码器进行联合损失训练有助于模型收敛,并且大规模预训练模型实现了卓越的解码准确性。本研究有助于推动纳米孔测序信号解码领域的发展,为技术进步做出贡献,并为研究人员和从业者提供新的概念和工具。