Rask Thomas S, Petersen Bent, Chen Donald S, Day Karen P, Pedersen Anders Gorm
Department of Systems Biology, Center for Biological Sequence Analysis, Technical University of Denmark, Building 208, Kongens Lyngby, DK-2800, Denmark.
Division of Medical Parasitology, Department of Microbiology, New York University Langone Medical Center, 341 East 25th Street, New York, NY, 10010, USA.
BMC Bioinformatics. 2016 Apr 22;17:176. doi: 10.1186/s12859-016-1032-7.
Amplicon pyrosequencing targets a known genetic region and thus inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame. Pyrosequencing errors, consisting mainly of nucleotide insertions and deletions, are on the other hand likely to disrupt open reading frames. Such an inverse relationship between errors and expectation based on prior knowledge can be used advantageously to guide the process known as basecalling, i.e. the inference of nucleotide sequence from raw sequencing data.
The new basecalling method described here, named Multipass, implements a probabilistic framework for working with the raw flowgrams obtained by pyrosequencing. For each sequence variant Multipass calculates the likelihood and nucleotide sequence of several most likely sequences given the flowgram data. This probabilistic approach enables integration of basecalling into a larger model where other parameters can be incorporated, such as the likelihood for observing a full-length open reading frame at the targeted region. We apply the method to 454 amplicon pyrosequencing data obtained from a malaria virulence gene family, where Multipass generates 20 % more error-free sequences than current state of the art methods, and provides sequence characteristics that allow generation of a set of high confidence error-free sequences.
This novel method can be used to increase accuracy of existing and future amplicon sequencing data, particularly where extensive prior knowledge is available about the obtained sequences, for example in analysis of the immunoglobulin VDJ region where Multipass can be combined with a model for the known recombining germline genes. Multipass is available for Roche 454 data at http://www.cbs.dtu.dk/services/MultiPass-1.0 , and the concept can potentially be implemented for other sequencing technologies as well.
扩增子焦磷酸测序针对已知的基因区域,因此本质上产生的读数很可能具有某些特征,例如保守的核苷酸序列,对于蛋白质编码DNA而言,则具有开放阅读框。另一方面,焦磷酸测序错误主要由核苷酸插入和缺失组成,很可能会破坏开放阅读框。基于先验知识的错误与预期之间的这种反比关系可有利地用于指导称为碱基识别的过程,即从原始测序数据推断核苷酸序列。
这里描述的新碱基识别方法名为Multipass,它实现了一个概率框架,用于处理通过焦磷酸测序获得的原始流动图。对于每个序列变体,Multipass根据流动图数据计算几种最可能序列的似然性和核苷酸序列。这种概率方法能够将碱基识别集成到一个更大的模型中,在该模型中可以纳入其他参数,例如在目标区域观察到全长开放阅读框的似然性。我们将该方法应用于从疟疾毒力基因家族获得的454扩增子焦磷酸测序数据,其中Multipass生成的无错误序列比当前的先进方法多20%,并提供了能够生成一组高可信度无错误序列的序列特征。
这种新方法可用于提高现有和未来扩增子测序数据 的准确性,特别是在对获得的序列有广泛先验知识的情况下,例如在免疫球蛋白VDJ区域的分析中,Multipass可以与已知重组种系基因的模型相结合。Multipass可在http://www.cbs.dtu.dk/services/MultiPass-1.0获取罗氏454数据,并且该概念也有可能应用于其他测序技术。