EPGA：基于读长和插入片段分布的从头组装。

EPGA: de novo assembly using the distributions of reads and insert size.

机构信息

School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA School of Information Science and Engineering, Central South University, ChangSha 410083, China, College of Computer Science and Technology, Henan Polytechnic University, JiaoZuo, 454000, China, Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan S7N 5A9, Canada and Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.

出版信息

Bioinformatics. 2015 Mar 15;31(6):825-33. doi: 10.1093/bioinformatics/btu762. Epub 2014 Nov 17.

DOI:10.1093/bioinformatics/btu762

PMID:25406329

Abstract

MOTIVATION

In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results.

RESULTS

In this article, we develop an algorithm, called extract paths for genome assembly (EPGA), which extracts paths from De Bruijn graph for genome assembly. EPGA uses a new score function to evaluate extension candidates based on the distributions of reads and insert size. The distribution of reads can solve problems caused by sequencing errors and short repetitive regions. Through assessing the variation of the distribution of insert size, EPGA can solve problems introduced by some complex repetitive regions. For solving uneven sequencing depth, EPGA uses relative mapping to evaluate extension candidates. On real datasets, we compare the performance of EPGA and other popular assemblers. The experimental results demonstrate that EPGA can effectively obtain longer and more accurate contigs and scaffolds.

摘要

动机

在基因组组装中，主要问题是如何确定序列种子的上下游序列区域，以构建长的连续序列或支架。在扩展一个序列种子时，基因组中的重复区域总是会产生多个可行的扩展候选者，这增加了基因组组装的难度。普遍接受的解决方案是根据读取重叠和配对末端（mate-pair）读取来选择一个。然而，这种解决方案在一些复杂的重复区域方面存在困难。此外，测序错误可能会产生假的重复区域，而不均匀的测序深度会导致某些序列区域的读取数量过少或过多。所有上述问题都使得现有的组装器无法获得满意的组装结果。

结果

在本文中，我们开发了一种算法，称为基因组组装的路径提取（EPGA），它从基因组的 De Bruijn 图中提取路径。EPGA 使用新的评分函数根据读取和插入大小的分布来评估扩展候选者。读取的分布可以解决测序错误和短重复区域引起的问题。通过评估插入大小分布的变化，EPGA 可以解决一些复杂重复区域引入的问题。为了解决不均匀的测序深度问题，EPGA 使用相对映射来评估扩展候选者。在真实数据集上，我们比较了 EPGA 和其他流行的组装器的性能。实验结果表明，EPGA 可以有效地获得更长和更准确的连续序列和支架。

相似文献

EPGA: de novo assembly using the distributions of reads and insert size.

Bioinformatics. 2015 Mar 15;31(6):825-33. doi: 10.1093/bioinformatics/btu762. Epub 2014 Nov 17.

EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jul-Aug;18(4):1492-1503. doi: 10.1109/TCBB.2019.2945761. Epub 2021 Aug 6.

ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):916-925. doi: 10.1109/TCBB.2016.2550433. Epub 2016 Apr 5.

De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application.

BMC Bioinformatics. 2018 Jul 18;19(1):273. doi: 10.1186/s12859-018-2281-4.

Illumina error correction near highly repetitive DNA regions improves de novo genome assembly.

BMC Bioinformatics. 2019 Jun 3;20(1):298. doi: 10.1186/s12859-019-2906-2.

SOPRA: Scaffolding algorithm for paired reads via statistical optimization.

BMC Bioinformatics. 2010 Jun 24;11:345. doi: 10.1186/1471-2105-11-345.

Improving de novo Assembly Based on Read Classification.

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):177-188. doi: 10.1109/TCBB.2018.2861380. Epub 2018 Jul 30.

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Bioinformatics. 2012 Jun 1;28(11):1420-8. doi: 10.1093/bioinformatics/bts174. Epub 2012 Apr 11.

BOSS: a novel scaffolding algorithm based on an optimized scaffold graph.

Bioinformatics. 2017 Jan 15;33(2):169-176. doi: 10.1093/bioinformatics/btw597. Epub 2016 Sep 14.

SLR: a scaffolding algorithm based on long reads and contig classification.

BMC Bioinformatics. 2019 Oct 30;20(1):539. doi: 10.1186/s12859-019-3114-9.

引用本文的文献

SIns: A Novel Insertion Detection Approach Based on Soft-Clipped Reads.

Front Genet. 2021 Apr 30;12:665812. doi: 10.3389/fgene.2021.665812. eCollection 2021.

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.

BMC Bioinformatics. 2020 Oct 19;21(1):463. doi: 10.1186/s12859-020-03779-w.

LROD: An Overlap Detection Algorithm for Long Reads Based on -mer Distribution.

Front Genet. 2020 Jul 29;11:632. doi: 10.3389/fgene.2020.00632. eCollection 2020.

SLR: a scaffolding algorithm based on long reads and contig classification.

BMC Bioinformatics. 2019 Oct 30;20(1):539. doi: 10.1186/s12859-019-3114-9.

Facilitated sequence assembly using densely labeled optical DNA barcodes: A combinatorial auction approach.

PLoS One. 2018 Mar 9;13(3):e0193900. doi: 10.1371/journal.pone.0193900. eCollection 2018.

Re-alignment of the unmapped reads with base quality score.

BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S8. doi: 10.1186/1471-2105-16-S5-S8. Epub 2015 Mar 18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

EPGA：基于读长和插入片段分布的从头组装。

EPGA: de novo assembly using the distributions of reads and insert size.

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献