Suppr超能文献

通过整合基于非序列的数据来增强基因组组装。

Enhancing genome assemblies by integrating non-sequence based data.

作者信息

Heider Thomas N, Lindsay James, Wang Chenwei, O'Neill Rachel J, Pask Andrew J

机构信息

Department of Molecular and Cellular Biology, University of Connecticut, 06269, Storrs CT, USA.

出版信息

BMC Proc. 2011 May 28;5 Suppl 2(Suppl 2):S7. doi: 10.1186/1753-6561-5-S2-S7.

Abstract

INTRODUCTION

Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies.

METHODS

The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome.

RESULTS

Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total) but a 35% increase in N50 scaffold size over the use of sequence-based data alone.

CONCLUSIONS

We provide a relatively simple pipeline utilizing existing bioinformatics tools to integrate map data into a genome assembly which is available at http://www.mcb.uconn.edu/fac.php?name=paska. While the map data only contributed minimally to assigning the initial contigs to scaffolds in the new assembly, it greatly increased the N50 size. This process added structure to our low coverage assembly, greatly increasing its utility in further analyses.

摘要

引言

在高通量测序出现之前,许多基因组计划就已经在进行中,因此得到了来自其他技术的大量基因组信息的支持。此类信息通常以连锁图谱和物理图谱的形式呈现,这两种图谱都能提供大量对从头测序项目有用的数据。此外,近期丰富的基因组资源使得利用在相关物种中鉴定出的保守共线性图谱来进一步提升基因组组装成为可能。

方法

帚尾袋鼩(Macropus eugenii)是一种具有低覆盖度基因组的有袋类哺乳动物模型。然而,我们能够获取广泛的比较图谱,这些图谱包含通过保守基因座的物理定位、染色体涂染和综合连锁图谱构建的超过14000个标记。使用定制的Bioperl管道,通过BLAT将图谱中的信息与组装好的帚尾袋鼩重叠群进行比对。这些数据被用于构建间隔范围为5 - 10MB的伪双末端文库。然后我们使用Bambus(一个旨在通过利用双末端数据对重叠群进行排序和定向来搭建真核生物基因组支架的程序)来搭建我们的文库。为了确定图谱数据与基于序列的方法相比在提升组装效果方面如何,我们使用来自4KB和8KB Illumina双末端文库的0.5倍覆盖度的唯一读取片段重复了该实验。最后,我们将基于序列和非序列的数据结合起来,以确定一种组合方法如何能够进一步提升帚尾袋鼩基因组低覆盖度从头重建的质量。

结果

仅使用图谱数据,我们能够将2.2%的初始重叠群排列到支架中,并将N50支架大小增加到39KB(原始组装中为36KB)。仅使用基于0.5倍双末端序列的数据时,53%的初始重叠群被分配到支架中。将两个数据集结合起来,使得整合到支架中的初始重叠群数量进一步增加了2%(总计55%),但N50支架大小比仅使用基于序列的数据时增加了35%。

结论

我们提供了一个相对简单的管道,利用现有的生物信息学工具将图谱数据整合到基因组组装中,该管道可在http://www.mcb.uconn.edu/fac.php?name=paska获取。虽然图谱数据在新组装中对将初始重叠群分配到支架的贡献最小,但它极大地增加了N50大小。这个过程为我们的低覆盖度组装增添了结构,极大地提高了其在进一步分析中的实用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a49/3090765/f1147bc92558/1753-6561-5-S2-S7-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验