REPdenovo：从短序列读取中推断从头重复基序

REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.

作者信息

Chu Chong, Nielsen Rasmus, Wu Yufeng

机构信息

Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, United States of America.

Department of Integrative Biology, University of California, Berkeley, CA 94720, United States of America.

出版信息

PLoS One. 2016 Mar 15;11(3):e0150719. doi: 10.1371/journal.pone.0150719. eCollection 2016.

DOI:10.1371/journal.pone.0150719

PMID:26977803

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4792456/

Abstract

Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo.

摘要

重复元件是真核生物基因组的重要组成部分。我们对重复元件理解的一个局限性在于，大多数分析依赖于不完整的参考基因组，这些基因组在难以组装的高度重复区域往往包含缺失数据。为克服这一问题，我们开发了一种新方法REPdenovo，它可直接从鸟枪法测序原始数据中组装重复序列。REPdenovo能够构建各种高度重复且拷贝内序列差异低的重复类型。我们表明，REPdenovo在其恢复的重复序列数量和完整性方面均显著优于现有方法。REPdenovo的关键优势在于它能够从序列读数中重建长重复序列。我们将该方法应用于人类数据，并发现了一些先前重复注释遗漏的潜在新重复序列。其中许多序列被纳入各种寄生虫基因组，这可能是因为寄生虫基因组测序中涉及的宿主DNA过滤过程未能排除宿主衍生的重复序列。REPdenovo是一种用于注释基因组以及解决有关重复家族进化问题的强大新计算工具。软件工具REPdenovo可在https://github.com/Reedwarbler/REPdenovo上下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/491b/4792456/849cdc71213c/pone.0150719.g001.jpg

相似文献

REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.REPdenovo：从短序列读取中推断从头重复基序

PLoS One. 2016 Mar 15;11(3):e0150719. doi: 10.1371/journal.pone.0150719. eCollection 2016.

An improved approach for reconstructing consensus repeats from short sequence reads.一种从短序列读段中重构一致重复序列的改进方法。

BMC Genomics. 2018 Aug 13;19(Suppl 6):566. doi: 10.1186/s12864-018-4920-6.

GAPPadder: a sensitive approach for closing gaps on draft genomes with short sequence reads.GAPPadder：一种使用短序列读长来闭合草图基因组缺口的灵敏方法。

BMC Genomics. 2019 Jun 6;20(Suppl 5):426. doi: 10.1186/s12864-019-5703-4.

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.RepAHR：通过组装高频读段进行从头鉴定重复序列的改进方法。

BMC Bioinformatics. 2020 Oct 19;21(1):463. doi: 10.1186/s12859-020-03779-w.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats.大麦基因组的低通量鸟枪法测序有助于快速鉴定基因、保守非编码序列和新型重复序列。

BMC Genomics. 2008 Oct 31;9:518. doi: 10.1186/1471-2164-9-518.

Identification of repeats in DNA sequences using nucleotide distribution uniformity.利用核苷酸分布均匀性鉴定DNA序列中的重复序列。

J Theor Biol. 2017 Jan 7;412:138-145. doi: 10.1016/j.jtbi.2016.10.013. Epub 2016 Nov 2.

Combination of de novo assembly of massive sequencing reads with classical repeat prediction improves identification of repetitive sequences in Schistosoma mansoni.大规模测序reads 的从头组装与经典重复预测相结合可提高曼氏血吸虫重复序列的鉴定。

Exp Parasitol. 2012 Apr;130(4):470-4. doi: 10.1016/j.exppara.2012.02.010. Epub 2012 Feb 21.

RF: a method for filtering short reads with tandem repeats for genome mapping.RF：一种用于基因组图谱构建的带有串联重复的短读过滤方法。

Genomics. 2013 Jul;102(1):35-7. doi: 10.1016/j.ygeno.2013.03.002. Epub 2013 Mar 29.

RepARK--de novo creation of repeat libraries from whole-genome NGS reads.RepARK——从头创建来自全基因组 NGS 读取的重复文库。

Nucleic Acids Res. 2014 May;42(9):e80. doi: 10.1093/nar/gku210. Epub 2014 Mar 14.

引用本文的文献

A draft genome assembly for the dart-poison frog .箭毒蛙的基因组组装草图。

GigaByte. 2025 Jun 20;2025:gigabyte157. doi: 10.46471/gigabyte.157. eCollection 2025.

Study of Dispersed Repeats in the Genome.基因组中分散重复序列的研究

Int J Mol Sci. 2024 Apr 18;25(8):4441. doi: 10.3390/ijms25084441.

Centuries of genome instability and evolution in soft-shell clam, Mya arenaria, bivalve transmissible neoplasia.软壳蛤（Mya arenaria）双壳贝类传染性肿瘤中的基因组不稳定性和演化的数世纪。

Nat Cancer. 2023 Nov;4(11):1561-1574. doi: 10.1038/s43018-023-00643-7. Epub 2023 Oct 2.

Repetitive DNA sequence detection and its role in the human genome.重复 DNA 序列检测及其在人类基因组中的作用。

Commun Biol. 2023 Sep 19;6(1):954. doi: 10.1038/s42003-023-05322-y.

Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges.字符串“ACGT”阵列的基因组组装组成：数据结构准确性和性能挑战综述

PeerJ Comput Sci. 2023 Jul 13;9:e1180. doi: 10.7717/peerj-cs.1180. eCollection 2023.

Twinkle twinkle brittle star: the draft genome of Ophioderma brevispinum (Echinodermata: Ophiuroidea) as a resource for regeneration research.闪烁的脆弱星星：短腕蛇尾（棘皮动物门：蛇尾纲）的基因组草案，作为再生研究的资源。

BMC Genomics. 2022 Aug 11;23(1):574. doi: 10.1186/s12864-022-08750-y.

Methodologies for the Discovery of Transposable Element Families.转座元件家族发现方法学

Genes (Basel). 2022 Apr 17;13(4):709. doi: 10.3390/genes13040709.

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.BigFiRSt：一种使用大数据技术从大规模测序数据中挖掘简单序列重复序列的软件程序。

Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.

msRepDB: a comprehensive repetitive sequence database of over 80 000 species.msRepDB：一个涵盖超过 80000 个物种的综合重复序列数据库。

Nucleic Acids Res. 2022 Jan 7;50(D1):D236-D245. doi: 10.1093/nar/gkab1089.

A dense linkage map for a large repetitive genome: discovery of the sex-determining region in hybridizing fire-bellied toads (Bombina bombina and Bombina variegata).一个大型重复基因组的高密度连锁图谱：杂交火腹蟾蜍（Bombina bombina 和 Bombina variegata）性别决定区域的发现。

G3 (Bethesda). 2021 Dec 8;11(12). doi: 10.1093/g3journal/jkab286.

本文引用的文献

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.利用单分子测序和局部敏感哈希组装大型基因组。

Nat Biotechnol. 2015 Jun;33(6):623-30. doi: 10.1038/nbt.3238. Epub 2015 May 25.

The UCSC Genome Browser database: 2015 update.加州大学圣克鲁兹分校基因组浏览器数据库：2015年更新

Nucleic Acids Res. 2015 Jan;43(Database issue):D670-81. doi: 10.1093/nar/gku1177. Epub 2014 Nov 26.

Resolving the complexity of the human genome using single-molecule sequencing.利用单分子测序解析人类基因组的复杂性。

Nature. 2015 Jan 29;517(7536):608-11. doi: 10.1038/nature13907. Epub 2014 Nov 10.

TEMP: a computational method for analyzing transposable element polymorphism in populations.TEMP：一种用于分析群体中转座元件多态性的计算方法。

Nucleic Acids Res. 2014 Jun;42(11):6826-38. doi: 10.1093/nar/gku323. Epub 2014 Apr 21.

RepARK--de novo creation of repeat libraries from whole-genome NGS reads.RepARK——从头创建来自全基因组 NGS 读取的重复文库。

Nucleic Acids Res. 2014 May;42(9):e80. doi: 10.1093/nar/gku210. Epub 2014 Mar 14.

Transposon Insertion Finder (TIF): a novel program for detection of de novo transpositions of transposable elements.转座子插入查找器（TIF）：一种用于检测新出现的转座子转座的新型程序。

BMC Bioinformatics. 2014 Mar 14;15:71. doi: 10.1186/1471-2105-15-71.

Mobile element scanning (ME-Scan) identifies thousands of novel Alu insertions in diverse human populations.移动元件扫描（ME-Scan）在不同的人类群体中鉴定出数千种新的 Alu 插入。

Genome Res. 2013 Jul;23(7):1170-81. doi: 10.1101/gr.148973.112. Epub 2013 Apr 18.

RetroSeq: transposable element discovery from next-generation sequencing data.RetroSeq：从下一代测序数据中发现转座子元件。

Bioinformatics. 2013 Feb 1;29(3):389-90. doi: 10.1093/bioinformatics/bts697. Epub 2012 Dec 10.

Dfam: a database of repetitive DNA based on profile hidden Markov models.Dfam：基于隐马尔可夫模型的重复 DNA 数据库。

Nucleic Acids Res. 2013 Jan;41(Database issue):D70-82. doi: 10.1093/nar/gks1265. Epub 2012 Nov 30.

An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

REPdenovo：从短序列读取中推断从头重复基序

REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献