Suppr超能文献

ntEdit:可扩展的基因组序列优化。

ntEdit: scalable genome sequence polishing.

机构信息

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

BC Ministry of Forests, Lands, and Natural Resource Operations, Victoria, Canada.

出版信息

Bioinformatics. 2019 Nov 1;35(21):4430-4432. doi: 10.1093/bioinformatics/btz400.

Abstract

MOTIVATION

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes.

RESULTS

We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024.

AVAILABILITY AND IMPLEMENTATION

https://github.com/bcgsc/ntedit.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在现代基因组学时代,基因组序列组装是常规做法。然而,根据方法的不同,生成的草图可能包含相当多的碱基错误。虽然存在用于基因组碱基修复的工具,但它们在高读取覆盖度下效果最佳,并且扩展性不好。我们开发了 ntEdit,这是一种基于布隆过滤器的基因组序列编辑实用程序,可扩展到大型哺乳动物和针叶树基因组。

结果

我们首先在受控的大肠杆菌和秀丽隐杆线虫序列数据上测试了 ntEdit 和最先进的组装改进工具 GATK、Pilon 和 Racon。一般来说,ntEdit 在低测序深度(<20×)下表现良好,修复了大多数(>97%)碱基替换和插入缺失,并且其性能随着覆盖度的增加而基本保持不变。在使用单个 CPU 进行的所有实验中,ntEdit 管道分别在大肠杆菌和秀丽隐杆线虫上平均执行时间<14s 和<3m。我们在一个覆盖度<20×的人类基因组序列数据集上进行了类似的基准测试,检查了编辑染色体 1 和 21 以及整个基因组时的准确性和资源使用情况。ntEdit 呈线性扩展,在这些序列上执行时间为 30-40m。我们展示了如何在<2 小时 20 分钟内使用来自同一个体的高覆盖度(54×)Illumina 序列数据来改进 NA12878 的长链和连接读取人类基因组组装,修复编码序列中的移码。我们还从单倍体序列源(种子大配子体)生成了 17 倍覆盖率云杉序列数据,并分别在<4 小时和<5 小时内使用它编辑我们的 20Gb 内部和白色云杉基因组的伪单倍体组装,在(替换+插入缺失)率为 0.0024 的情况下进行了大约 50M 次编辑。

可用性和实现

https://github.com/bcgsc/ntedit。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45ba/6821332/b99705e09b2b/btz400f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验