Schreiber Jacob, Karplus Kevin
Nanopore Group, Department of Biomolecular Engineering, University of California Santa Cruz, CA 95064, USA.
Bioinformatics. 2015 Jun 15;31(12):1897-903. doi: 10.1093/bioinformatics/btv046. Epub 2015 Feb 3.
Nanopore-based sequencing techniques can reconstruct properties of biosequences by analyzing the sequence-dependent ionic current steps produced as biomolecules pass through a pore. Typically this involves alignment of new data to a reference, where both reference construction and alignment have been performed by hand.
We propose an automated method for aligning nanopore data to a reference through the use of hidden Markov models. Several features that arise from prior processing steps and from the class of enzyme used can be simply incorporated into the model. Previously, the M2MspA nanopore was shown to be sensitive enough to distinguish between cytosine, methylcytosine and hydroxymethylcytosine. We validated our automated methodology on a subset of that data by automatically calculating an error rate for the distinction between the three cytosine variants and show that the automated methodology produces a 2-3% error rate, lower than the 10% error rate from previous manual segmentation and alignment.
The data, output, scripts and tutorials replicating the analysis are available at https://github.com/UCSCNanopore/Data/tree/master/Automation.
基于纳米孔的测序技术可以通过分析生物分子穿过孔时产生的与序列相关的离子电流步骤来重建生物序列的特性。通常,这涉及将新数据与参考序列进行比对,而参考序列的构建和比对都是手动完成的。
我们提出了一种通过使用隐马尔可夫模型将纳米孔数据与参考序列进行比对的自动化方法。先前处理步骤和所用酶类产生的几个特征可以简单地纳入模型。此前,已证明M2MspA纳米孔足够灵敏,能够区分胞嘧啶、甲基胞嘧啶和羟甲基胞嘧啶。我们通过自动计算三种胞嘧啶变体之间区分的错误率,在该数据的一个子集上验证了我们的自动化方法,并表明该自动化方法产生的错误率为2% - 3%,低于先前手动分割和比对的10%的错误率。
复制该分析的数据、输出、脚本和教程可在https://github.com/UCSCNanopore/Data/tree/master/Automation获取。