Lonardi Stefano, Mirebrahim Hamid, Wanamaker Steve, Alpert Matthew, Ciardo Gianfranco, Duma Denisa, Close Timothy J
Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA Department of Computer Science and Engineering, Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, Department of Computer Science, Iowa State University, Ames, IA 50011 and Baylor College of Medicine, Houston, TX 77030, USA.
Bioinformatics. 2015 Sep 15;31(18):2972-80. doi: 10.1093/bioinformatics/btv311. Epub 2015 May 20.
As the invention of DNA sequencing in the 70s, computational biologists have had to deal with the problem of de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing.
We explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to bacterial artificial chromosome (BAC) clones (in the context of the combinatorial pooling design we have recently proposed), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on 'divide and conquer': we 'slice' a large dataset into smaller samples of optimal size, decode each slice independently, and then merge the results. Experimental results on over 15 000 barley BACs and over 4000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data.
Python scripts to process slices and resolve decoding conflicts are available from http://goo.gl/YXgdHT; software Hashfilter can be downloaded from http://goo.gl/MIyZHs
stelo@cs.ucr.edu or timothy.close@ucr.edu
Supplementary data are available at Bioinformatics online.
自70年代DNA测序发明以来,计算生物学家就不得不应对测序深度有限(或不足)情况下的从头基因组组装问题。在这项工作中,我们研究相反的问题,即处理测序深度过高的挑战。
我们在两个领域探索了超深度测序数据的影响:(i)将 reads 解码为细菌人工染色体(BAC)克隆的问题(在我们最近提出的组合池设计背景下),以及(ii)BAC克隆的从头组装问题。使用真实的超深度测序数据,我们表明,当测序深度超过某个阈值时,测序错误会使这两个问题变得越来越难(而不是像无错误数据那样变得更容易),结果是随着数据越来越多,解决方案的质量会下降。对于第一个问题,我们提出了一种基于“分治”的有效解决方案:我们将一个大的数据集“切片”成最优大小的较小样本,独立解码每个切片,然后合并结果。对超过15000个大麦BAC和超过4000个豇豆BAC的实验结果表明,解码质量和最终组装质量有显著提高。对于第二个问题,我们首次表明现代的从头组装器无法利用超深度测序数据。
处理切片和解决解码冲突的Python脚本可从http://goo.gl/YXgdHT获取;软件Hashfilter可从http://goo.gl/MIyZHs下载。
stelo@cs.ucr.edu或timothy.close@ucr.edu
补充数据可在《生物信息学》在线获取。