Beslic Denis, Kucklick Martin, Engelmann Susanne, Fuchs Stephan, Renard Bernhard Y, Körber Nils
Centre for Artificial Intelligence in Public Health Research, Robert Koch Institute, Berlin 13353, Germany.
Institute for Microbiology, Technical University of Braunschweig, Braunschweig 38106, Germany.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae744.
Nanopore sequencing represents a significant advancement in genomics, enabling direct long-read DNA sequencing at the single-molecule level. Accurate simulation of nanopore sequencing signals from nucleotide sequences is crucial for method development and for complementing experimental data. Most existing approaches rely on predefined statistical models, which may not adequately capture the properties of experimental signal data. Furthermore, these simulators were developed for earlier versions of nanopore chemistry, which limits their applicability and adaptability to the latest flow cell data.
To enhance the quality of artificial signals, we introduce seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. Unlike existing simulators that rely on static k-mer models, our approach learns sequential contextual information from segmented signal data. We benchmark seq2squiggle against state-of-the-art simulators on real experimental R9.4.1 and R10.4.1 data, evaluating signal similarity, basecalling accuracy, and variant detection rates. Seq2squiggle consistently outperforms existing tools across multiple datasets, demonstrating superior similarity to real data and offering a robust solution for simulating nanopore sequencing signals with the latest flow cell generation.
seq2squiggle is freely available on GitHub at: github.com/ZKI-PH-ImageAnalysis/seq2squiggle.
纳米孔测序代表了基因组学的一项重大进展,能够在单分子水平上进行直接的长读长DNA测序。从核苷酸序列准确模拟纳米孔测序信号对于方法开发和补充实验数据至关重要。大多数现有方法依赖于预定义的统计模型,可能无法充分捕捉实验信号数据的特性。此外,这些模拟器是为早期版本的纳米孔化学开发的,这限制了它们对最新流动池数据的适用性和适应性。
为了提高人工信号的质量,我们引入了seq2squiggle,这是一种基于新型变换器的非自回归模型,旨在从核苷酸序列生成纳米孔测序信号。与依赖静态k-mer模型的现有模拟器不同,我们的方法从分段信号数据中学习序列上下文信息。我们在真实实验的R9.4.1和R10.4.1数据上,将seq2squiggle与最先进的模拟器进行基准测试,评估信号相似性、碱基识别准确性和变异检测率。Seq2squiggle在多个数据集中始终优于现有工具,显示出与真实数据的卓越相似性,并为使用最新流动池一代模拟纳米孔测序信号提供了一个强大的解决方案。
seq2squiggle可在GitHub上免费获取:github.com/ZKI-PH-ImageAnalysis/seq2squiggle。