Department of Electrical Engineering & Computer Sciences, University of California , Berkeley, CA 94720, USA.
Department of Chemical Physics, Tel Aviv University , Tel Aviv-Yafo, Israel.
Open Biol. 2024 Jun;14(6):230449. doi: 10.1098/rsob.230449. Epub 2024 Jun 12.
Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.
纳米孔测序平台与监督机器学习 (ML) 相结合,已能有效检测 DNA 中的碱基修饰,如 5-甲基胞嘧啶 (5mC) 和 N6-甲基腺嘌呤 (6mA)。这些基于 ML 的纳米孔调用器通常是在跨越所有可能的 DNA [Formula: see text]-mer 背景上所有修饰的训练数据上进行训练的 - 一个训练数据集。然而,随着纳米孔技术被推向越来越多的表观遗传修饰,这样完整的训练数据将不可能获得。纳米孔调用传统上是使用隐马尔可夫模型 (HMM) 进行的,由于其独立的发射分布,对于在训练过程中未见过的 [Formula: see text]-mer 上下文,HMM 无法进行成功的调用。然而,深度神经网络 (DNN),它在上下文之间共享参数,越来越多地被用作调用器,通常优于它们的 HMM 表亲。从逻辑上讲,DNN 方法应该能够更好地推广到看不见的 [Formula: see text]-mer 上下文。事实上,在这里我们证明,在不完整数据设置中,一种常见的 DNN 方法 (DeepSignal) 优于常见的 HMM 方法 (Nanopolish)。此外,我们提出了一种新颖的混合 HMM-DNN 方法,即摊销 HMM,当训练数据不完整时,它在 5mC 调用方面优于纯 HMM 和 DNN 方法。这种方法有望用于调用其他碱基修饰,如 5-羟甲基胞嘧啶,以及同时调用不同的修饰,在这些情况下,完整的训练数据不太可能可用。