ReRep：基因组调查序列（GSS）中重复序列的计算检测

ReRep: computational detection of repetitive sequences in genome survey sequences (GSS).

作者信息

Otto Thomas D, Gomes Leonardo H F, Alves-Ferreira Marcelo, de Miranda Antonio B, Degrave Wim M

机构信息

Laboratory for Functional Genomics and Bioinformatics, IOC, Fiocruz, Rio de Janeiro, Brazil.

出版信息

BMC Bioinformatics. 2008 Sep 9;9:366. doi: 10.1186/1471-2105-9-366.

DOI:10.1186/1471-2105-9-366

PMID:18782453

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2559850/

Abstract

BACKGROUND

Genome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers.

RESULTS

We designed and implemented a straightforward pipeline called ReRep, which combines bioinformatics tools for identifying repetitive structures in a GSS dataset. In a case study, we first applied the pipeline to a set of 970 GSSs, sequenced in our laboratory from the human pathogen Leishmania braziliensis, the causative agent of leishmaniosis, an important public health problem in Brazil. We also verified the applicability of ReRep to new sequencing technologies using a set of 454-reads of an Escheria coli. The behaviour of several parameters in the algorithm is evaluated and suggestions are made for tuning of the analysis.

CONCLUSION

The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a L. braziliensis GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases. ReRep also identified most of the E. coli K12 repeats prior to assembly in an example dataset obtained by automated sequencing using 454 technology. The parameters controlling the algorithm behaved consistently and may be tuned to the properties of the dataset, in particular to the length of sequencing reads and the genome coverage. ReRep is freely available for academic use at http://bioinfo.pdtis.fiocruz.br/ReRep/.

摘要

背景

基因组调查序列（GSS）能提供基因组的初步全局视图，因为与EST不同，它们既覆盖编码DNA也覆盖非编码DNA，还包括基因组的重复区域。在基因组测序项目的早期阶段，对重复序列的性质、数量和变异性进行更精确的估计非常重要，因为这些数据会强烈影响基因组覆盖率、文库质量和支架构建进展的估计。此外，在初始组装过程中去除重复序列对于避免错误和不必要的复杂性很重要。重复序列在各种其他研究中也很有意义，例如作为分子标记。

结果

我们设计并实施了一个名为ReRep的简单流程，该流程结合了生物信息学工具来识别GSS数据集中的重复结构。在一个案例研究中，我们首先将该流程应用于一组970个GSS，这些序列是我们实验室从巴西利什曼原虫（Leishmania braziliensis）测序得到的，巴西利什曼原虫是利什曼病的病原体，是巴西一个重要的公共卫生问题。我们还使用一组大肠杆菌的454测序读段验证了ReRep对新测序技术的适用性。评估了算法中几个参数的行为，并对分析的调整提出了建议。

结论

用于识别GSS数据集中重复元件的ReRep方法被证明是简单且有效的。在我们实验室生成的巴西利什曼原虫GSS数据集中发现了几个潜在的重复序列，并通过对EMBL和桑格中心数据库中更完整的基因组数据集的分析进一步验证。在使用454技术自动测序获得的一个示例数据集中，ReRep在组装前也识别出了大多数大肠杆菌K12重复序列。控制算法的参数表现一致，可以根据数据集的特性进行调整，特别是测序读段的长度和基因组覆盖率。ReRep可在http://bioinfo.pdtis.fiocruz.br/ReRep/免费用于学术用途。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70f0/2559850/ff882fd221eb/1471-2105-9-366-1.jpg

相似文献

ReRep: computational detection of repetitive sequences in genome survey sequences (GSS).

BMC Bioinformatics. 2008 Sep 9;9:366. doi: 10.1186/1471-2105-9-366.

WindowMasker: window-based masker for sequenced genomes.

Bioinformatics. 2006 Jan 15;22(2):134-41. doi: 10.1093/bioinformatics/bti774. Epub 2005 Nov 15.

Illumina error correction near highly repetitive DNA regions improves de novo genome assembly.

BMC Bioinformatics. 2019 Jun 3;20(1):298. doi: 10.1186/s12859-019-2906-2.

Using RepeatMasker to identify repetitive elements in genomic sequences.

Curr Protoc Bioinformatics. 2004 May;Chapter 4:Unit 4.10. doi: 10.1002/0471250953.bi0410s05.

RBR: library-less repeat detection for ESTs.

Bioinformatics. 2006 Sep 15;22(18):2232-6. doi: 10.1093/bioinformatics/btl368. Epub 2006 Jul 12.

Assembly of highly repetitive genomes using short reads: the genome of discrete typing unit III Trypanosoma cruzi strain 231.

Microb Genom. 2018 Apr;4(4). doi: 10.1099/mgen.0.000156. Epub 2018 Feb 14.

HomologMiner: looking for homologous genomic groups in whole genomes.

Bioinformatics. 2007 Apr 15;23(8):917-25. doi: 10.1093/bioinformatics/btm048. Epub 2007 Feb 18.

Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats.

BMC Genomics. 2008 Oct 31;9:518. doi: 10.1186/1471-2164-9-518.

ReDiT: Repeat Discrepancy Tagger--a shotgun assembly finishing aid.

Bioinformatics. 2004 Mar 22;20(5):803-4. doi: 10.1093/bioinformatics/bth004. Epub 2004 Jan 29.

A sensitive repeat identification framework based on short and long reads.

Nucleic Acids Res. 2021 Sep 27;49(17):e100. doi: 10.1093/nar/gkab563.

引用本文的文献

Streamlining of Simple Sequence Repeat Data Mining Methodologies and Pipelines for Crop Scanning.

Plants (Basel). 2024 Sep 19;13(18):2619. doi: 10.3390/plants13182619.

Prospects of genomic resources available at the global databases for the flora of United Arab Emirates.

3 Biotech. 2019 Sep;9(9):333. doi: 10.1007/s13205-019-1855-9. Epub 2019 Aug 16.

The Tvv1 retrotransposon family is conserved between plant genomes separated by over 100 million years.

Theor Appl Genet. 2014 May;127(5):1223-35. doi: 10.1007/s00122-014-2293-z. Epub 2014 Mar 4.

Genomic islands of divergence in hybridizing Heliconius butterflies identified by large-scale targeted sequencing.

Philos Trans R Soc Lond B Biol Sci. 2012 Feb 5;367(1587):343-53. doi: 10.1098/rstb.2011.0198.

Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes.

Chromosome Res. 2011 Aug;19(6):787-808. doi: 10.1007/s10577-011-9230-7.

Genomic hotspots for adaptation: the population genetics of Müllerian mimicry in the Heliconius melpomene clade.

PLoS Genet. 2010 Feb 5;6(2):e1000794. doi: 10.1371/journal.pgen.1000794.

What can you do with 0.1x genome coverage? A case study based on a genome survey of the scuttle fly Megaselia scalaris (Phoridae).

BMC Genomics. 2009 Aug 18;10:382. doi: 10.1186/1471-2164-10-382.

Automatic identification of species-specific repetitive DNA sequences and their utilization for detecting microbial organisms.

Bioinformatics. 2009 Jun 1;25(11):1349-55. doi: 10.1093/bioinformatics/btp241. Epub 2009 Apr 8.

本文引用的文献

Comparative genomic analysis of three Leishmania species that cause diverse human disease.

Nat Genet. 2007 Jul;39(7):839-47. doi: 10.1038/ng2053. Epub 2007 Jun 17.

Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey.

BMC Genomics. 2007 May 24;8:132. doi: 10.1186/1471-2164-8-132.

DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions.

BMC Bioinformatics. 2006 Mar 20;7:155. doi: 10.1186/1471-2105-7-155.

Characterization of LST-R533: uncovering a novel repetitive element in Leishmania.

Int J Parasitol. 2006 Feb;36(2):211-7. doi: 10.1016/j.ijpara.2005.10.002. Epub 2005 Nov 21.

Genome sequencing in microfabricated high-density picolitre reactors.

Nature. 2005 Sep 15;437(7057):376-80. doi: 10.1038/nature03959. Epub 2005 Jul 31.

The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease.

Science. 2005 Jul 15;309(5733):409-15. doi: 10.1126/science.1112631.

A survey of Leishmania braziliensis genome by shotgun sequencing.

Mol Biochem Parasitol. 2004 Sep;137(1):81-6. doi: 10.1016/j.molbiopara.2004.05.001.

Tracking repeats using significance and transitivity.

Bioinformatics. 2004 Aug 4;20 Suppl 1:i311-7. doi: 10.1093/bioinformatics/bth911.

Versatile and open software for comparing large genomes.

Genome Biol. 2004;5(2):R12. doi: 10.1186/gb-2004-5-2-r12. Epub 2004 Jan 30.

Annotating large genomes with exact word matches.

Genome Res. 2003 Oct;13(10):2306-15. doi: 10.1101/gr.1350803. Epub 2003 Sep 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ReRep：基因组调查序列（GSS）中重复序列的计算检测

ReRep: computational detection of repetitive sequences in genome survey sequences (GSS).

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献