叙述:基于期望最大化的新一代测序数据纠错工具。
Recount: expectation maximization based error correction tool for next generation sequencing data.
作者信息
Wijaya Edward, Frith Martin C, Suzuki Yutaka, Horton Paul
机构信息
AIST, Computational Biology Research Center, 2-42 Aomi, Koutou-Ku, Tokyo 135-0064, Japan.
出版信息
Genome Inform. 2009 Oct;23(1):189-201.
Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.
新一代测序技术能够快速大规模地生成序列数据集。不幸的是,这些技术也存在不可忽视的测序错误率,会通过引入错误读数和减少真实读数数量来使输出结果产生偏差。尽管为SAGE数据开发的方法可以在很大程度上减少这些错误计数,但到目前为止,它们尚未以可扩展的方式实现。最近,一个名为FREC的程序已被开发出来,用于解决新一代测序数据的这一问题。在本文中,我们介绍了RECOUNT,这是我们对用于标签计数校正的期望最大化算法的实现,并将其与FREC进行了比较。使用参考基因组和模拟数据,我们发现RECOUNT的性能与FREC相当或更好,同时使用的内存要少得多(例如5GB对75GB)。此外,我们报告了在基因表达分析背景下对真实数据进行标签计数校正的首次分析。我们的结果表明,标签计数校正不仅增加了可映射标签的数量,而且可以对新一代测序数据的生物学解释产生实际影响。RECOUNT是一个开源的C++程序,可从http://seq.cbrc.jp/recount获取。