Health Intelligence Center.
Human Genome Center.
Bioinformatics. 2021 Jun 9;37(9):1211-1217. doi: 10.1093/bioinformatics/btaa953.
In recent years, nanopore sequencing technology has enabled inexpensive long-read sequencing, which promises reads longer than a few thousand bases. Such long-read sequences contribute to the precise detection of structural variations and accurate haplotype phasing. However, deciphering precise DNA sequences from noisy and complicated nanopore raw signals remains a crucial demand for downstream analyses based on higher-quality nanopore sequencing, although various basecallers have been introduced to date.
To address this need, we developed a novel basecaller, Halcyon, that incorporates neural-network techniques frequently used in the field of machine translation. Our model employs monotonic-attention mechanisms to learn semantic correspondences between nucleotides and signal levels without any pre-segmentation against input signals. We evaluated performance with a human whole-genome sequencing dataset and demonstrated that Halcyon outperformed existing third-party basecallers and achieved competitive performance against the latest Oxford Nanopore Technologies' basecallers.
The source code (halcyon) can be found at https://github.com/relastle/halcyon.
近年来,纳米孔测序技术实现了廉价的长读测序,有望获得数千个碱基以上的长读序列。这些长读序列有助于精确检测结构变异和准确的单倍型相位。然而,从嘈杂和复杂的纳米孔原始信号中破译精确的 DNA 序列仍然是基于高质量纳米孔测序的下游分析的关键需求,尽管迄今为止已经引入了各种碱基调用器。
为了满足这一需求,我们开发了一种新的碱基调用器 Halcyon,它结合了机器翻译领域常用的神经网络技术。我们的模型采用单调注意机制,在不针对输入信号进行任何预分段的情况下,学习核苷酸和信号电平之间的语义对应关系。我们使用人类全基因组测序数据集评估了性能,结果表明 Halcyon 优于现有的第三方碱基调用器,并与最新的牛津纳米孔技术碱基调用器具有竞争力。
源代码(halcyon)可在 https://github.com/relastle/halcyon 找到。