• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用机器学习和比较基因组学进行重叠校正,提高从头序列组装质量。

Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction.

机构信息

Siemens Corporate Research, 755 College Road East, Princeton, NJ, USA.

出版信息

BMC Bioinformatics. 2010 Jan 15;11:33. doi: 10.1186/1471-2105-11-33.

DOI:10.1186/1471-2105-11-33
PMID:20078885
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2824677/
Abstract

BACKGROUND

With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps.

RESULTS

We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies.

CONCLUSIONS

Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.

摘要

背景

随着 DNA 测序数据库的快速扩张,现在可以从先前的测序项目和已完成的基因组中识别相关信息,并将其应用于新生物体的从头测序。本文以实例演示了如何通过增加重叠步骤来利用这些额外信息来改进从头组装。在许多基因组组装器中,找到所有重叠读取对是一项关键任务,为此,已经开发出了高效的算法来在大型序列集合中找到比对。众所周知,由于重复序列的存在,许多对齐的读取对实际上并不重叠。但是,迄今为止,没有任何重叠算法严格区分对齐但不重叠的读取对与真正的重叠。

结果

我们提出了一种方法,通过数据驱动的步骤将 Minimus 组装器扩展,以便在构建连续体之前将重叠分类为真实或虚假。我们在 Weka 框架内使用从先前测序项目中获得的各种读取重叠派生的各种统计信息,在多个分类模型中进行了训练。这些统计信息包括重叠中的错配百分比和 k-mer 频率,以及从将读取映射到多个参考基因组的比较基因组得分。我们表明,在来自大肠杆菌和金黄色葡萄球菌基因组的真实全基因组测序数据中,通过为组装器的连续体阶段提供一组经过整理的重叠,我们将中位数连续体长度(N50)提高了近一倍,而不会牺牲基因组的覆盖率或增加错误组装的数量。

结论

使用比较和非比较特征来将重叠分类为真实或虚假的机器学习方法可以用于提高序列组装的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/e0e8c0e2679c/1471-2105-11-33-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/28e8f1c44ead/1471-2105-11-33-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/895ad0b69a96/1471-2105-11-33-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/18e05fa89332/1471-2105-11-33-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/e0e8c0e2679c/1471-2105-11-33-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/28e8f1c44ead/1471-2105-11-33-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/895ad0b69a96/1471-2105-11-33-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/18e05fa89332/1471-2105-11-33-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a06/2824677/e0e8c0e2679c/1471-2105-11-33-4.jpg

相似文献

1
Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction.利用机器学习和比较基因组学进行重叠校正,提高从头序列组装质量。
BMC Bioinformatics. 2010 Jan 15;11:33. doi: 10.1186/1471-2105-11-33.
2
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致(OLC)方法的最佳性能。
BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.
3
De novo likelihood-based measures for comparing genome assemblies.用于比较基因组组装的基于从头似然性的度量
BMC Res Notes. 2013 Aug 22;6:334. doi: 10.1186/1756-0500-6-334.
4
Clover: a clustering-oriented de novo assembler for Illumina sequences.Clover:一款面向聚类的 Illumina 序列从头组装程序。
BMC Bioinformatics. 2020 Nov 17;21(1):528. doi: 10.1186/s12859-020-03788-9.
5
Improving de novo Assembly Based on Read Classification.基于读段分类的从头组装改进。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):177-188. doi: 10.1109/TCBB.2018.2861380. Epub 2018 Jul 30.
6
Illumina error correction near highly repetitive DNA regions improves de novo genome assembly.Illumina 纠错技术在高度重复 DNA 区域的应用提高了从头基因组组装的质量。
BMC Bioinformatics. 2019 Jun 3;20(1):298. doi: 10.1186/s12859-019-2906-2.
7
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
8
Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies.评估使用 Mate-Pairs 解决从头组装的短读 prokaryotic 重复的好处。
BMC Bioinformatics. 2011 Apr 13;12:95. doi: 10.1186/1471-2105-12-95.
9
HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads.HGA:一种利用高覆盖度短测序读段进行细菌基因组从头组装的方法。
BMC Genomics. 2016 Mar 5;17:193. doi: 10.1186/s12864-016-2515-7.
10
Heterozygous genome assembly via binary classification of homologous sequence.通过同源序列的二元分类进行杂合基因组组装。
BMC Bioinformatics. 2015;16 Suppl 7(Suppl 7):S5. doi: 10.1186/1471-2105-16-S7-S5. Epub 2015 Apr 23.

引用本文的文献

1
Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks.用于食源性疾病暴发流行病学调查的机器学习方法
Front Microbiol. 2019 Aug 6;10:1722. doi: 10.3389/fmicb.2019.01722. eCollection 2019.
2
Broiler chickens can benefit from machine learning: support vector machine analysis of observational epidemiological data.肉鸡可以从机器学习中受益:支持向量机对观察性流行病学数据的分析。
J R Soc Interface. 2012 Aug 7;9(73):1934-42. doi: 10.1098/rsif.2011.0852. Epub 2012 Feb 8.
3
LOCAS--a low coverage assembly tool for resequencing projects.

本文引用的文献

1
Aggressive assembly of pyrosequencing reads with mates.将焦磷酸测序读数与配对序列进行积极组装。
Bioinformatics. 2008 Dec 15;24(24):2818-24. doi: 10.1093/bioinformatics/btn548. Epub 2008 Oct 24.
2
High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.高通量测序为伤寒沙门氏菌的基因组变异和进化提供了见解。
Nat Genet. 2008 Aug;40(8):987-93. doi: 10.1038/ng.195. Epub 2008 Jul 27.
3
Improving Phrap-based assembly of the rat using "reliable" overlaps.利用“可靠”重叠区域改进基于Phrap的大鼠基因组组装。
LOCAS--用于重测序项目的低覆盖度组装工具。
PLoS One. 2011;6(8):e23455. doi: 10.1371/journal.pone.0023455. Epub 2011 Aug 15.
4
Quake: quality-aware detection and correction of sequencing errors.Quake:测序错误的质量感知检测和校正。
Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.
PLoS One. 2008 Mar 19;3(3):e1836. doi: 10.1371/journal.pone.0001836.
4
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.《天鹅绒:使用德布鲁因图进行从头短读长拼接的算法》
Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.
5
Genome assembly forensics: finding the elusive mis-assembly.基因组组装取证:寻找难以捉摸的错误组装
Genome Biol. 2008;9(3):R55. doi: 10.1186/gb-2008-9-3-r55. Epub 2008 Mar 14.
6
ALLPATHS: de novo assembly of whole-genome shotgun microreads.ALLPATHS:全基因组鸟枪法测序短读段的从头组装。
Genome Res. 2008 May;18(5):810-20. doi: 10.1101/gr.7337908. Epub 2008 Mar 13.
7
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.从头开始的细菌基因组测序:在台式计算机上组装数百万条非常短的读段。
Genome Res. 2008 May;18(5):802-9. doi: 10.1101/gr.072033.107. Epub 2008 Mar 10.
8
Short read fragment assembly of bacterial genomes.细菌基因组的短读片段组装
Genome Res. 2008 Feb;18(2):324-30. doi: 10.1101/gr.7088808. Epub 2007 Dec 14.
9
Minimus: a fast, lightweight genome assembler.Minimus:一款快速、轻量级的基因组组装工具。
BMC Bioinformatics. 2007 Feb 26;8:64. doi: 10.1186/1471-2105-8-64.
10
Efficient q-gram filters for finding all epsilon-matches over a given length.用于在给定长度上查找所有ε匹配项的高效q-gram过滤器。
J Comput Biol. 2006 Mar;13(2):296-308. doi: 10.1089/cmb.2006.13.296.