Suppr超能文献

ReRep:基因组调查序列(GSS)中重复序列的计算检测

ReRep: computational detection of repetitive sequences in genome survey sequences (GSS).

作者信息

Otto Thomas D, Gomes Leonardo H F, Alves-Ferreira Marcelo, de Miranda Antonio B, Degrave Wim M

机构信息

Laboratory for Functional Genomics and Bioinformatics, IOC, Fiocruz, Rio de Janeiro, Brazil.

出版信息

BMC Bioinformatics. 2008 Sep 9;9:366. doi: 10.1186/1471-2105-9-366.

Abstract

BACKGROUND

Genome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers.

RESULTS

We designed and implemented a straightforward pipeline called ReRep, which combines bioinformatics tools for identifying repetitive structures in a GSS dataset. In a case study, we first applied the pipeline to a set of 970 GSSs, sequenced in our laboratory from the human pathogen Leishmania braziliensis, the causative agent of leishmaniosis, an important public health problem in Brazil. We also verified the applicability of ReRep to new sequencing technologies using a set of 454-reads of an Escheria coli. The behaviour of several parameters in the algorithm is evaluated and suggestions are made for tuning of the analysis.

CONCLUSION

The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a L. braziliensis GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases. ReRep also identified most of the E. coli K12 repeats prior to assembly in an example dataset obtained by automated sequencing using 454 technology. The parameters controlling the algorithm behaved consistently and may be tuned to the properties of the dataset, in particular to the length of sequencing reads and the genome coverage. ReRep is freely available for academic use at http://bioinfo.pdtis.fiocruz.br/ReRep/.

摘要

背景

基因组调查序列(GSS)能提供基因组的初步全局视图,因为与EST不同,它们既覆盖编码DNA也覆盖非编码DNA,还包括基因组的重复区域。在基因组测序项目的早期阶段,对重复序列的性质、数量和变异性进行更精确的估计非常重要,因为这些数据会强烈影响基因组覆盖率、文库质量和支架构建进展的估计。此外,在初始组装过程中去除重复序列对于避免错误和不必要的复杂性很重要。重复序列在各种其他研究中也很有意义,例如作为分子标记。

结果

我们设计并实施了一个名为ReRep的简单流程,该流程结合了生物信息学工具来识别GSS数据集中的重复结构。在一个案例研究中,我们首先将该流程应用于一组970个GSS,这些序列是我们实验室从巴西利什曼原虫(Leishmania braziliensis)测序得到的,巴西利什曼原虫是利什曼病的病原体,是巴西一个重要的公共卫生问题。我们还使用一组大肠杆菌的454测序读段验证了ReRep对新测序技术的适用性。评估了算法中几个参数的行为,并对分析的调整提出了建议。

结论

用于识别GSS数据集中重复元件的ReRep方法被证明是简单且有效的。在我们实验室生成的巴西利什曼原虫GSS数据集中发现了几个潜在的重复序列,并通过对EMBL和桑格中心数据库中更完整的基因组数据集的分析进一步验证。在使用454技术自动测序获得的一个示例数据集中,ReRep在组装前也识别出了大多数大肠杆菌K12重复序列。控制算法的参数表现一致,可以根据数据集的特性进行调整,特别是测序读段的长度和基因组覆盖率。ReRep可在http://bioinfo.pdtis.fiocruz.br/ReRep/免费用于学术用途。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70f0/2559850/ff882fd221eb/1471-2105-9-366-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验