SOPRA：基于统计优化的配对读取支架算法。

SOPRA: Scaffolding algorithm for paired reads via statistical optimization.

机构信息

Department of Physics and Astronomy, Rutgers University, Piscataway, New Jersey, USA.

出版信息

BMC Bioinformatics. 2010 Jun 24;11:345. doi: 10.1186/1471-2105-11-345.

DOI:10.1186/1471-2105-11-345

PMID:20576136

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2909219/

Abstract

BACKGROUND

High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.

RESULTS

We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors.

CONCLUSIONS

Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.

摘要

背景

高通量测序（HTS）平台在每次运行时都会产生千兆字节的短读（<100bp）数据。虽然这些短读足以满足重测序应用，但从这些读段从头组装中等大小的基因组仍然是一个重大挑战。通过利用配对末端技术，这些限制可以部分克服，该技术提供了基因组上已知距离的一对短读段。

结果

我们开发了 SOPRA，这是一种设计用于利用配对末端/成对末端信息进行短读段组装的工具。该算法的主要重点是选择一个足够大的同时满足的配对末端约束子集，以在输出支架的大小和质量之间取得平衡。支架组装被呈现为与顶点和连接图的边相关的变量的优化问题。该图的顶点是个体支架，边缘是通过配对末端连接的支架之间绘制的。在以前的测序项目中，已经在霰弹枪测序和支架构建的背景下调用了类似的图问题。然而，鉴于 HTS 数据的易错性质以及由于读段较短而带来的根本限制，在当前背景下，早期研究中使用的特定贪婪算法可能会导致质量较差的结果。SOPRA 通过平等对待所有约束来解决优化问题，从而避免了这个问题，解决方案本身表明需要去除有问题的约束（嵌合/重复支架等）。该约束的解决和去除过程一直迭代，直到达到一个核心的一致约束集。对于 SOLiD 测序仪数据，SOPRA 使用动态规划方法来稳健地将颜色空间组装转换为碱基空间。为了评估组装的质量，我们报告无匹配/不匹配错误率以及各种重排错误率。

结论

将 SOPRA 应用于来自细菌基因组的真实数据，我们能够将支架组装成具有显著长度的支架（N50 高达 200kb），并且在组装过程中引入的错误很少。一般来说，这里提出的方法学将允许更好地组装任何类型的配对末端测序数据的支架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0768/2909219/6278a231bcb0/1471-2105-11-345-1.jpg

相似文献

SOPRA: Scaffolding algorithm for paired reads via statistical optimization.

BMC Bioinformatics. 2010 Jun 24;11:345. doi: 10.1186/1471-2105-11-345.

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies.

BMC Bioinformatics. 2011 Apr 13;12:95. doi: 10.1186/1471-2105-12-95.

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.

Bioinformatics. 2012 Jun 1;28(11):1429-37. doi: 10.1093/bioinformatics/bts175. Epub 2012 Apr 6.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

EPGA: de novo assembly using the distributions of reads and insert size.

Bioinformatics. 2015 Mar 15;31(6):825-33. doi: 10.1093/bioinformatics/btu762. Epub 2014 Nov 17.

GapFiller: a de novo assembly approach to fill the gap within paired reads.

BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S8. doi: 10.1186/1471-2105-13-S14-S8. Epub 2012 Sep 7.

Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers.

J Comput Biol. 2011 Nov;18(11):1625-34. doi: 10.1089/cmb.2011.0151. Epub 2011 Oct 14.

SCOP: a novel scaffolding algorithm based on contig classification and optimization.

Bioinformatics. 2019 Apr 1;35(7):1142-1150. doi: 10.1093/bioinformatics/bty773.

Scaffolding pre-assembled contigs using SSPACE.

Bioinformatics. 2011 Feb 15;27(4):578-9. doi: 10.1093/bioinformatics/btq683. Epub 2010 Dec 12.

SLIQ: simple linear inequalities for efficient contig scaffolding.

J Comput Biol. 2012 Oct;19(10):1162-75. doi: 10.1089/cmb.2011.0263.

引用本文的文献

Maptcha: an efficient parallel workflow for hybrid genome scaffolding.

BMC Bioinformatics. 2024 Aug 8;25(1):263. doi: 10.1186/s12859-024-05878-4.

Graph-based self-supervised learning for repeat detection in metagenomic assembly.

Genome Res. 2024 Oct 11;34(9):1468-1476. doi: 10.1101/gr.279136.124.

Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing.

Cell Rep Methods. 2024 May 20;4(5):100754. doi: 10.1016/j.crmeth.2024.100754. Epub 2024 Apr 12.

RegScaf: a regression approach to scaffolding.

Bioinformatics. 2022 May 13;38(10):2675-2682. doi: 10.1093/bioinformatics/btac174.

Characterization of Isolate Reveals New Prospects in Waste Stream Valorization for Bacterial Cellulose Production.

Microorganisms. 2021 Oct 26;9(11):2230. doi: 10.3390/microorganisms9112230.

SWALO: scaffolding with assembly likelihood optimization.

Nucleic Acids Res. 2021 Nov 18;49(20):e117. doi: 10.1093/nar/gkab717.

Empirical evaluation of methods for genome assembly.

PeerJ Comput Sci. 2021 Jul 9;7:e636. doi: 10.7717/peerj-cs.636. eCollection 2021.

Sequencing and assembly of the Egyptian buffalo genome.

PLoS One. 2020 Aug 19;15(8):e0237087. doi: 10.1371/journal.pone.0237087. eCollection 2020.

Differential Contribution of the Parental Genomes to a × Hybrid, Inferred by Phenomic, Genomic, and Transcriptomic Analyses, at Different Industrial Stress Conditions.

Front Bioeng Biotechnol. 2020 Mar 3;8:129. doi: 10.3389/fbioe.2020.00129. eCollection 2020.

Genome structure reveals the diversity of mating mechanisms in x hybrids, and the genomic instability that promotes phenotypic diversity.

Microb Genom. 2020 Mar;6(3). doi: 10.1099/mgen.0.000333.

本文引用的文献

Filtering error from SOLiD Output.

Bioinformatics. 2010 Mar 15;26(6):849-50. doi: 10.1093/bioinformatics/btq045.

Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.

Genome Res. 2009 Sep;19(9):1527-41. doi: 10.1101/gr.091868.109. Epub 2009 Jun 22.

Genome assembly reborn: recent computational challenges.

Brief Bioinform. 2009 Jul;10(4):354-66. doi: 10.1093/bib/bbp026. Epub 2009 May 29.

Application of 'next-generation' sequencing technologies to microbial genetics.

Nat Rev Microbiol. 2009 Apr;7(4):287-96. doi: 10.1038/nrmicro2122.

De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads.

FEMS Microbiol Lett. 2009 Feb;291(1):103-11. doi: 10.1111/j.1574-6968.2008.01441.x. Epub 2008 Dec 9.

Next-generation DNA sequencing.

Nat Biotechnol. 2008 Oct;26(10):1135-45. doi: 10.1038/nbt1486.

Gene-boosted assembly of a novel bacterial genome from very short reads.

PLoS Comput Biol. 2008 Sep 26;4(9):e1000186. doi: 10.1371/journal.pcbi.1000186.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.

ALLPATHS: de novo assembly of whole-genome shotgun microreads.

Genome Res. 2008 May;18(5):810-20. doi: 10.1101/gr.7337908. Epub 2008 Mar 13.

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.

Genome Res. 2008 May;18(5):802-9. doi: 10.1101/gr.072033.107. Epub 2008 Mar 10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

SOPRA：基于统计优化的配对读取支架算法。

SOPRA: Scaffolding algorithm for paired reads via statistical optimization.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献