Institute of Marine Research, P.O. Box 1870, N-5817 Bergen, Norway.
Bioinformatics. 2011 Jul 1;27(13):i304-9. doi: 10.1093/bioinformatics/btr251.
454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates. Although there are several tools available for noise removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types.
By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.
The flowsim pipeline is freely available under the General Public License from http://biohaskell.org/Applications/FlowSim.
罗氏诊断公司的 454 焦磷酸测序在读取长度、性能和成本方面已经取代了桑格测序,但它的每个碱基错误率更高。虽然有几种工具可用于去除噪声,针对不同的应用领域,但数据解释将受益于更好地了解不同的错误类型。
通过探索 454 原始数据,我们量化了不同因素在多大程度上导致了测序错误。除了众所周知的长重复序列长度不准确之外,我们还确定了可能源自测序过程其他阶段的错误。我们利用这些发现扩展了 flowsim 管道的功能,以模拟这些错误,从而使 flowsim 能够更真实地模拟 454 焦磷酸测序数据。
flowsim 管道可根据通用公共许可证从 http://biohaskell.org/Applications/FlowSim 免费获得。