Suppr超能文献

Gfastats:使用组装图转换、评估和操作基因组序列。

Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs.

机构信息

The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY 10065, USA.

Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-University Freiburg, Freiburg 79110, Germany.

出版信息

Bioinformatics. 2022 Sep 2;38(17):4214-4216. doi: 10.1093/bioinformatics/btac460.

Abstract

MOTIVATION

With the current pace at which reference genomes are being produced, the availability of tools that can reliably and efficiently generate genome assembly summary statistics has become critical. Additionally, with the emergence of new algorithms and data types, tools that can improve the quality of existing assemblies through automated and manual curation are required.

RESULTS

We sought to address both these needs by developing gfastats, as part of the Vertebrate Genomes Project (VGP) effort to generate high-quality reference genomes at scale. Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in FASTA, FASTQ or GFA [.gz] format. Gfastats stores assembly sequences internally in a GFA-like format. This feature allows gfastats to seamlessly convert FAST* to and from GFA [.gz] files. Gfastats can also build an assembly graph that can in turn be used to manipulate the underlying sequences following instructions provided by the user, while simultaneously generating key metrics for the new sequences.

AVAILABILITY AND IMPLEMENTATION

Gfastats is implemented in C++. Precompiled releases (Linux, MacOS, Windows) and commented source code for gfastats are available under MIT licence at https://github.com/vgl-hub/gfastats. Examples of how to run gfastats are provided in the GitHub. Gfastats is also available in Bioconda, in Galaxy (https://assembly.usegalaxy.eu) and as a MultiQC module (https://github.com/ewels/MultiQC). An automated test workflow is available to ensure consistency of software updates.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着参考基因组的生成速度不断加快,能够可靠且高效地生成基因组组装汇总统计信息的工具变得至关重要。此外,随着新算法和数据类型的出现,需要能够通过自动化和手动策展来提高现有组装质量的工具。

结果

我们试图通过开发 gfastats 来满足这两个需求,这是生成高质量大规模参考基因组的脊椎动物基因组计划 (VGP) 工作的一部分。Gfastats 是一个独立的工具,用于计算组装汇总统计信息,并以 FASTA、FASTQ 或 GFA [.gz] 格式操作组装序列。Gfastats 在内部以类似于 GFA 的格式存储组装序列。此功能允许 gfastats 无缝地在 FAST* 和 GFA [.gz] 文件之间转换。Gfastats 还可以构建组装图,然后可以根据用户提供的指令用于操作底层序列,同时为新序列生成关键指标。

可用性和实现

Gfastats 是用 C++编写的。预编译版本(Linux、MacOS、Windows)和 gfastats 的注释源代码可根据 MIT 许可证在 https://github.com/vgl-hub/gfastats 获得。在 GitHub 中提供了如何运行 gfastats 的示例。Gfastats 也可在 Bioconda、Galaxy(https://assembly.usegalaxy.eu)和 MultiQC 模块(https://github.com/ewels/MultiQC)中使用。还提供了自动化测试工作流程,以确保软件更新的一致性。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a6b/9438950/761b6fc4706a/btac460f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验