Suppr超能文献

一种用于下一代测序数据的可扩展且准确的靶向基因组装工具(SAT组装器)。

A scalable and accurate targeted gene assembly tool (SAT-Assembler) for next-generation sequencing data.

作者信息

Zhang Yuan, Sun Yanni, Cole James R

机构信息

Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America.

Center for Microbial Ecology, Michigan State University, East Lansing, Michigan, United States of America.

出版信息

PLoS Comput Biol. 2014 Aug 14;10(8):e1003737. doi: 10.1371/journal.pcbi.1003737. eCollection 2014 Aug.

Abstract

Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.

摘要

基因组装是从短读段中恢复基因片段的过程,是下一代测序数据功能分析中的重要步骤。由于缺乏高质量的参考基因组,从头组装通常用于非模式生物的RNA-Seq数据和宏基因组数据。然而,由异质表达或物种丰度导致的异质序列覆盖、异构体或同源基因之间的相似性以及大数据量都给从头组装带来了挑战。因此,现有的组装工具往往会输出碎片化的重叠群或嵌合重叠群,或者占用大量内存。在这项工作中,我们介绍了一个靶向基因组装程序SAT-Assembler,其目的是恢复生物学家特别感兴趣的基因家族。它通过进行家族特异性同源性搜索、同源性引导的重叠图构建以及仔细的图遍历,解决了上述挑战。它可应用于RNA-Seq数据和宏基因组数据。我们在一个拟南芥RNA-Seq数据集和两个宏基因组数据集上的实验结果表明,与其他组装工具相比,SAT-Assembler在从一个或多个途径组装一组基因时,内存使用量更小,基因覆盖率相当或更好,嵌合率更低。此外,家族特异性设计和快速同源性搜索使SAT-Assembler能够自然地与并行计算平台兼容。SAT-Assembler的源代码可在https://sourceforge.net/projects/sat-assembler/获取。数据集和实验设置可在补充材料中找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a6ca/4133164/bbf724463435/pcbi.1003737.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验