Suppr超能文献

利用机器学习和比较基因组学进行重叠校正,提高从头序列组装质量。

Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction.

机构信息

Siemens Corporate Research, 755 College Road East, Princeton, NJ, USA.

出版信息

BMC Bioinformatics. 2010 Jan 15;11:33. doi: 10.1186/1471-2105-11-33.

Abstract

BACKGROUND

With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps.

RESULTS

We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies.

CONCLUSIONS

Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.

摘要

背景

随着 DNA 测序数据库的快速扩张,现在可以从先前的测序项目和已完成的基因组中识别相关信息,并将其应用于新生物体的从头测序。本文以实例演示了如何通过增加重叠步骤来利用这些额外信息来改进从头组装。在许多基因组组装器中,找到所有重叠读取对是一项关键任务,为此,已经开发出了高效的算法来在大型序列集合中找到比对。众所周知,由于重复序列的存在,许多对齐的读取对实际上并不重叠。但是,迄今为止,没有任何重叠算法严格区分对齐但不重叠的读取对与真正的重叠。

结果

我们提出了一种方法,通过数据驱动的步骤将 Minimus 组装器扩展,以便在构建连续体之前将重叠分类为真实或虚假。我们在 Weka 框架内使用从先前测序项目中获得的各种读取重叠派生的各种统计信息,在多个分类模型中进行了训练。这些统计信息包括重叠中的错配百分比和 k-mer 频率,以及从将读取映射到多个参考基因组的比较基因组得分。我们表明,在来自大肠杆菌和金黄色葡萄球菌基因组的真实全基因组测序数据中,通过为组装器的连续体阶段提供一组经过整理的重叠,我们将中位数连续体长度(N50)提高了近一倍,而不会牺牲基因组的覆盖率或增加错误组装的数量。

结论

使用比较和非比较特征来将重叠分类为真实或虚假的机器学习方法可以用于提高序列组装的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/28e8f1c44ead/1471-2105-11-33-1.jpg

相似文献

2
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.
BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.
3
De novo likelihood-based measures for comparing genome assemblies.
BMC Res Notes. 2013 Aug 22;6:334. doi: 10.1186/1756-0500-6-334.
4
Clover: a clustering-oriented de novo assembler for Illumina sequences.
BMC Bioinformatics. 2020 Nov 17;21(1):528. doi: 10.1186/s12859-020-03788-9.
5
Improving de novo Assembly Based on Read Classification.
IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):177-188. doi: 10.1109/TCBB.2018.2861380. Epub 2018 Jul 30.
6
Illumina error correction near highly repetitive DNA regions improves de novo genome assembly.
BMC Bioinformatics. 2019 Jun 3;20(1):298. doi: 10.1186/s12859-019-2906-2.
7
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
10
Heterozygous genome assembly via binary classification of homologous sequence.
BMC Bioinformatics. 2015;16 Suppl 7(Suppl 7):S5. doi: 10.1186/1471-2105-16-S7-S5. Epub 2015 Apr 23.

引用本文的文献

1
Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks.
Front Microbiol. 2019 Aug 6;10:1722. doi: 10.3389/fmicb.2019.01722. eCollection 2019.
2
Broiler chickens can benefit from machine learning: support vector machine analysis of observational epidemiological data.
J R Soc Interface. 2012 Aug 7;9(73):1934-42. doi: 10.1098/rsif.2011.0852. Epub 2012 Feb 8.
3
LOCAS--a low coverage assembly tool for resequencing projects.
PLoS One. 2011;6(8):e23455. doi: 10.1371/journal.pone.0023455. Epub 2011 Aug 15.
4
Quake: quality-aware detection and correction of sequencing errors.
Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

本文引用的文献

1
Aggressive assembly of pyrosequencing reads with mates.
Bioinformatics. 2008 Dec 15;24(24):2818-24. doi: 10.1093/bioinformatics/btn548. Epub 2008 Oct 24.
2
High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.
Nat Genet. 2008 Aug;40(8):987-93. doi: 10.1038/ng.195. Epub 2008 Jul 27.
3
Improving Phrap-based assembly of the rat using "reliable" overlaps.
PLoS One. 2008 Mar 19;3(3):e1836. doi: 10.1371/journal.pone.0001836.
4
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.
5
Genome assembly forensics: finding the elusive mis-assembly.
Genome Biol. 2008;9(3):R55. doi: 10.1186/gb-2008-9-3-r55. Epub 2008 Mar 14.
6
ALLPATHS: de novo assembly of whole-genome shotgun microreads.
Genome Res. 2008 May;18(5):810-20. doi: 10.1101/gr.7337908. Epub 2008 Mar 13.
7
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.
Genome Res. 2008 May;18(5):802-9. doi: 10.1101/gr.072033.107. Epub 2008 Mar 10.
8
Short read fragment assembly of bacterial genomes.
Genome Res. 2008 Feb;18(2):324-30. doi: 10.1101/gr.7088808. Epub 2007 Dec 14.
9
Minimus: a fast, lightweight genome assembler.
BMC Bioinformatics. 2007 Feb 26;8:64. doi: 10.1186/1471-2105-8-64.
10
Efficient q-gram filters for finding all epsilon-matches over a given length.
J Comput Biol. 2006 Mar;13(2):296-308. doi: 10.1089/cmb.2006.13.296.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验