School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.
School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.
Bioinformatics. 2015 Mar 15;31(6):825-33. doi: 10.1093/bioinformatics/btu762. Epub 2014 Nov 17.
In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results.
In this article, we develop an algorithm, called extract paths for genome assembly (EPGA), which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds.
在基因组组装中,主要问题是如何确定序列种子的上下游序列区域,以构建长的连续序列或支架。在扩展一个序列种子时,基因组中的重复区域总是会产生多个可行的扩展候选者,这增加了基因组组装的难度。普遍接受的解决方案是根据读取重叠和配对末端(mate-pair)读取来选择一个。然而,这种解决方案在一些复杂的重复区域方面存在困难。此外,测序错误可能会产生假的重复区域,而不均匀的测序深度会导致某些序列区域的读取数量过少或过多。所有上述问题都使得现有的组装器无法获得满意的组装结果。
在本文中,我们开发了一种算法,称为基因组组装的路径提取(EPGA),它从基因组的 De Bruijn 图中提取路径。EPGA 使用新的评分函数根据读取和插入大小的分布来评估扩展候选者。读取的分布可以解决测序错误和短重复区域引起的问题。通过评估插入大小分布的变化,EPGA 可以解决一些复杂重复区域引入的问题。为了解决不均匀的测序深度问题,EPGA 使用相对映射来评估扩展候选者。在真实数据集上,我们比较了 EPGA 和其他流行的组装器的性能。实验结果表明,EPGA 可以有效地获得更长和更准确的连续序列和支架。