GapFiller：一种从头开始的组装方法，用于填补配对读取中的缺口。

GapFiller: a de novo assembly approach to fill the gap within paired reads.

机构信息

Department of Mathematics and Computer Science, University of Udine, Udine 33100, Italy.

出版信息

BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S8. doi: 10.1186/1471-2105-13-S14-S8. Epub 2012 Sep 7.

DOI:10.1186/1471-2105-13-S14-S8

PMID:23095524

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3439727/

Abstract

BACKGROUND

Next Generation Sequencing technologies are able to provide high genome coverages at a relatively low cost. However, due to limited reads' length (from 30 bp up to 200 bp), specific bioinformatics problems have become even more difficult to solve. De novo assembly with short reads, for example, is more complicated at least for two reasons: first, the overall amount of "noisy" data to cope with increased and, second, as the reads' length decreases the number of unsolvable repeats grows. Our work's aim is to go at the root of the problem by providing a pre-processing tool capable to produce (in-silico) longer and highly accurate sequences from a collection of Next Generation Sequencing reads.

RESULTS

In this paper a seed-and-extend local assembler is presented. The kernel algorithm is a loop that, starting from a read used as seed, keeps extending it using heuristics whose main goal is to produce a collection of error-free and longer sequences. In particular, GapFiller carefully detects reliable overlaps and operates clustering similar reads in order to reconstruct the missing part between the two ends of the same insert. Our tool's output has been validated on 24 experiments using both simulated and real paired reads datasets. The output sequences are declared correct when the seed-mate is found. In the experiments performed, GapFiller was able to extend high percentages of the processed seeds and find their mates, with a false positives rate that turned out to be nearly negligible.

CONCLUSIONS

GapFiller, starting from a sufficiently high short reads coverage, is able to produce high coverages of accurate longer sequences (from 300 bp up to 3500 bp). The procedure to perform safe extensions, together with the mate-found check, turned out to be a powerful criterion to guarantee contigs' correctness. GapFiller has further potential, as it could be applied in a number of different scenarios, including the post-processing validation of insertions/deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences.

摘要

背景

下一代测序技术能够以相对较低的成本提供高基因组覆盖率。然而，由于读长有限（30bp 到 200bp），特定的生物信息学问题变得更加难以解决。例如，用短读长进行从头组装更加复杂，至少有两个原因：首先，需要处理的“嘈杂”数据量增加；其次，由于读长减小，无法解决的重复序列数量增加。我们的工作旨在通过提供一种预处理工具来解决这个问题，该工具能够从一组下一代测序读长中生成（计算机模拟的）更长和高度准确的序列。

结果

本文提出了一种基于种子和扩展的局部组装算法。核心算法是一个循环，从一个用作种子的读长开始，使用启发式方法不断扩展它，其主要目标是生成一组无错误且更长的序列。特别是，GapFiller 仔细检测可靠的重叠，并对相似的读长进行聚类，以重建同一插入物两端之间缺失的部分。我们的工具的输出在使用模拟和真实成对读长数据集的 24 个实验中得到了验证。当找到种子的配对时，输出序列被声明为正确。在执行的实验中，GapFiller 能够扩展高比例的处理种子并找到它们的配对，假阳性率几乎可以忽略不计。

结论

GapFiller 从足够高的短读长覆盖率开始，能够生成高覆盖率的准确长序列（300bp 到 3500bp）。执行安全扩展的过程以及配对检查结果证明是保证重叠群正确性的有力标准。GapFiller 还有进一步的潜力，因为它可以应用于许多不同的场景，包括插入/缺失检测管道的后处理验证、从头组装管道数据集的预处理例程，或用于组装、分析或验证序列池的任何分层方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ae85/3439727/15e1fc675880/1471-2105-13-S14-S8-1.jpg

相似文献

GapFiller: a de novo assembly approach to fill the gap within paired reads.

BMC Bioinformatics. 2012;13 Suppl 14(Suppl 14):S8. doi: 10.1186/1471-2105-13-S14-S8. Epub 2012 Sep 7.

BASE: a practical de novo assembler for large genomes using long NGS reads.

BMC Genomics. 2016 Aug 31;17 Suppl 5(Suppl 5):499. doi: 10.1186/s12864-016-2829-5.

Pseudo-Sanger sequencing: massively parallel production of long and near error-free reads using NGS technology.

BMC Genomics. 2013 Oct 17;14(1):711. doi: 10.1186/1471-2164-14-711.

GAPPadder: a sensitive approach for closing gaps on draft genomes with short sequence reads.

BMC Genomics. 2019 Jun 6;20(Suppl 5):426. doi: 10.1186/s12864-019-5703-4.

GAM-NGS: genomic assemblies merger for next generation sequencing.

BMC Bioinformatics. 2013;14 Suppl 7(Suppl 7):S6. doi: 10.1186/1471-2105-14-S7-S6. Epub 2013 Apr 22.

SeedsGraph: an efficient assembler for next-generation sequencing data.

BMC Med Genomics. 2015;8 Suppl 2(Suppl 2):S13. doi: 10.1186/1755-8794-8-S2-S13. Epub 2015 May 29.

ScaffMatch: scaffolding algorithm based on maximum weight matching.

Bioinformatics. 2015 Aug 15;31(16):2632-8. doi: 10.1093/bioinformatics/btv211. Epub 2015 Apr 17.

ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):916-925. doi: 10.1109/TCBB.2016.2550433. Epub 2016 Apr 5.

SOPRA: Scaffolding algorithm for paired reads via statistical optimization.

BMC Bioinformatics. 2010 Jun 24;11:345. doi: 10.1186/1471-2105-11-345.

The MaSuRCA genome assembler.

Bioinformatics. 2013 Nov 1;29(21):2669-77. doi: 10.1093/bioinformatics/btt476. Epub 2013 Aug 29.

引用本文的文献

The Complete Mitochondrial Genome of (Teleostei: Siluriformes: Amblycipitidae): Characterization, Phylogenetic Placement, and Insights into Genetic Diversity.

Genes (Basel). 2025 Aug 19;16(8):977. doi: 10.3390/genes16080977.

Complete Chloroplast Genome Sequences of Three Species: Genome Characterization, Comparative Analyses, and Phylogenetic Relationships Within Zingiberales.

Curr Issues Mol Biol. 2025 Mar 25;47(4):222. doi: 10.3390/cimb47040222.

Genomic analysis and antimicrobial resistance in human- and poultry-derived isolates from Hangzhou, China.

Front Microbiol. 2025 Jun 23;16:1599555. doi: 10.3389/fmicb.2025.1599555. eCollection 2025.

From insect endosymbiont to phloem colonizer: comparative genomics unveils the lifestyle transition of phytopathogenic strains.

mSystems. 2025 May 20;10(5):e0149624. doi: 10.1128/msystems.01496-24. Epub 2025 Apr 9.

Whole-Genome Sequencing and Fine Map Analysis of .

J Fungi (Basel). 2025 Feb 3;11(2):112. doi: 10.3390/jof11020112.

Draft genome sequence of , isolated from an Indian dairy cheese.

Microbiol Resour Announc. 2025 Mar 11;14(3):e0053424. doi: 10.1128/mra.00534-24. Epub 2025 Feb 13.

Comparative and phylogenetic analysis of the chloroplast genomes of four commonly used medicinal cultivars of Chrysanthemums morifolium.

BMC Plant Biol. 2024 Oct 22;24(1):992. doi: 10.1186/s12870-024-05679-0.

Characterization of the complete chloroplast genome sequence of Maximowicz 1859 (Asteraceae).

Mitochondrial DNA B Resour. 2024 Oct 14;9(10):1394-1399. doi: 10.1080/23802359.2024.2415130. eCollection 2024.

Halotolerant Endophytic Bacteria 7BS3110 with Hg Tolerance Isolated from in a Caribbean Mangrove from Colombia.

Microorganisms. 2024 Sep 7;12(9):1857. doi: 10.3390/microorganisms12091857.

The draft genomes of Crassostrea gasar and Crassostrea rhizophorae: key resources for leveraging oyster cultivation in the Southwest Atlantic.

BMC Genom Data. 2024 Sep 3;25(1):81. doi: 10.1186/s12863-024-01262-6.

本文引用的文献

Feature-by-feature--evaluating de novo sequence assembly.

PLoS One. 2012;7(2):e31002. doi: 10.1371/journal.pone.0031002. Epub 2012 Feb 3.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Genome Res. 2012 Mar;22(3):557-67. doi: 10.1101/gr.131383.111. Epub 2012 Jan 6.

rNA: a fast and accurate short reads numerical aligner.

Bioinformatics. 2012 Jan 1;28(1):123-4. doi: 10.1093/bioinformatics/btr617. Epub 2011 Nov 13.

Assemblathon 1: a competitive assessment of de novo short read assembly methods.

Genome Res. 2011 Dec;21(12):2224-41. doi: 10.1101/gr.126599.111. Epub 2011 Sep 16.

FLASH: fast length adjustment of short reads to improve genome assemblies.

Bioinformatics. 2011 Nov 1;27(21):2957-63. doi: 10.1093/bioinformatics/btr507. Epub 2011 Sep 7.

Error correction of high-throughput sequencing datasets with non-uniform coverage.

Bioinformatics. 2011 Jul 1;27(13):i137-41. doi: 10.1093/bioinformatics/btr208.

Comparing de novo genome assembly: the long and short of it.

PLoS One. 2011 Apr 29;6(4):e19175. doi: 10.1371/journal.pone.0019175.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

High-quality draft assemblies of mammalian genomes from massively parallel sequence data.

Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. doi: 10.1073/pnas.1017351108. Epub 2010 Dec 27.

Quake: quality-aware detection and correction of sequencing errors.

Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GapFiller：一种从头开始的组装方法，用于填补配对读取中的缺口。

GapFiller: a de novo assembly approach to fill the gap within paired reads.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献