Department of Computer Science and Engineering and Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA.
Bioinformatics. 2014 Jan 1;30(1):31-7. doi: 10.1093/bioinformatics/btt310. Epub 2013 Jun 3.
Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.
We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.
Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.
基于 de Bruijn 图框架的基因组组装工具依赖于一个参数 k,它代表了几种难以量化的竞争效应之间的权衡。目前缺乏能够自动估计最佳 k 值并/或快速生成 k-mer 丰度直方图的工具,从而使用户能够做出明智的决策。
我们开发了一种快速而准确的抽样方法,该方法构建了近似丰度直方图,与传统方法相比,性能提高了几个数量级。然后,我们提出了一种快速启发式算法,该算法使用生成的丰度直方图来估计最佳的 k 值。我们使用各种测序数据集来测试我们工具的有效性,发现它选择的 k 值可以得到一些最好的组装结果。
我们的工具 KmerGenie 可免费在:http://kmergenie.bx.psu.edu/ 获取。