Suppr超能文献

使用联合原始和事件纳米孔数据序列到序列处理进行碱基调用。

Basecalling Using Joint Raw and Event Nanopore Data Sequence-to-Sequence Processing.

机构信息

Institute of Computer Science, Faculty of Electronics and Information Technology, Warsaw University of Technology, 00-665 Warsaw, Poland.

出版信息

Sensors (Basel). 2022 Mar 15;22(6):2275. doi: 10.3390/s22062275.

Abstract

Third-generation DNA sequencers provided by Oxford Nanopore Technologies (ONT) produce a series of samples of an electrical current in the nanopore. Such a time series is used to detect the sequence of nucleotides. The task of translation of current values into nucleotide symbols is called basecalling. Various solutions for basecalling have already been proposed. The earlier ones were based on Hidden Markov Models, but the best ones use neural networks or other machine learning models. Unfortunately, achieved accuracy scores are still lower than competitive sequencing techniques, like Illumina's. Basecallers differ in the input data type-currently, most of them work on a raw data straight from the sequencer (time series of current). Still, the approach of using event data is also explored. Event data is obtained by preprocessing of raw data and dividing it into segments described by several features computed from raw data values within each segment. We propose a novel basecaller that uses joint processing of raw and event data. We define basecalling as a sequence-to-sequence translation, and we use a machine learning model based on an encoder-decoder architecture of recurrent neural networks. Our model incorporates twin encoders and an attention mechanism. We tested our solution on simulated and real datasets. We compare the full model accuracy results with its components: processing only raw or event data. We compare our solution with the existing ONT basecaller-Guppy. Results of numerical experiments show that joint raw and event data processing provides better basecalling accuracy than processing each data type separately. We implement an application called Ravvent, freely available under MIT licence.

摘要

第三代 DNA 测序仪由牛津纳米孔技术公司(ONT)提供,它在纳米孔中产生一系列电流样本。这种时间序列用于检测核苷酸序列。将电流值转换为核苷酸符号的任务称为碱基调用。已经提出了各种碱基调用解决方案。早期的方案基于隐马尔可夫模型,但最好的方案使用神经网络或其他机器学习模型。不幸的是,所达到的准确率仍然低于竞争测序技术,如 Illumina 的。碱基调用器在输入数据类型上有所不同——目前,大多数碱基调用器都基于直接从测序仪获取的原始数据(电流时间序列)。然而,使用事件数据的方法也在探索中。事件数据是通过对原始数据进行预处理并将其划分为几个特征描述的片段而获得的,这些特征是从每个片段内的原始数据值计算得出的。我们提出了一种新的碱基调用器,它使用原始数据和事件数据的联合处理。我们将碱基调用定义为序列到序列的翻译,并使用基于递归神经网络编码器-解码器架构的机器学习模型。我们的模型包含两个编码器和一个注意力机制。我们在模拟数据集和真实数据集上测试了我们的解决方案。我们将完整模型的准确率结果与其组件进行比较:仅处理原始数据或事件数据。我们将我们的解决方案与现有的 ONT 碱基调用器 Guppy 进行了比较。数值实验结果表明,联合原始数据和事件数据的处理比分别处理每种数据类型提供更好的碱基调用准确率。我们实现了一个名为 Ravvent 的应用程序,它可以根据 MIT 许可证自由使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/82fb/8954548/4d7dcec1e924/sensors-22-02275-g004.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验