Schaeffer Carly E, Figueroa Nathaniel D, Liu Xiaolin, Karro John E
Department of Computer Science and Software Engineering.
Department of Cell, Molecular, and Structural Biology.
Bioinformatics. 2016 Jun 15;32(12):i209-i215. doi: 10.1093/bioinformatics/btw258.
Transposable elements (TEs) and repetitive DNA make up a sizable fraction of Eukaryotic genomes, and their annotation is crucial to the study of the structure, organization, and evolution of any newly sequenced genome. Although RepeatMasker and nHMMER are useful for identifying these repeats, they require a pre-compiled repeat library-which is not always available. De novo identification tools such as Recon, RepeatScout or RepeatGluer serve to identify TEs purely from sequence content, but are either limited by runtimes that prohibit whole-genome use or degrade in quality in the presence of substitutions that disrupt the sequence patterns.
phRAIDER is a de novo TE identification tool that address the issues of excessive runtime without sacrificing sensitivity as compared to competing tools. The underlying model is a new definition of elementary repeats that incorporates the PatternHunter spaced seed model, allowing for greater sensitivity in the presence of genomic substitutions. As compared with the premier tool in the literature, RepeatScout, phRAIDER shows an average 10× speedup on any single human chromosome and has the ability to process the whole human genome in just over three hours. Here we discuss the tool, the theoretical model underlying the tool, and the results demonstrating its effectiveness.
phRAIDER is an open source tool available from https://github.com/karroje/phRAIDER CONTACT: : karroje@miamiOH.edu or
Supplementary data are available at Bioinformatics online.
转座元件(TEs)和重复DNA在真核生物基因组中占相当大的比例,对其进行注释对于研究任何新测序基因组的结构、组织和进化至关重要。虽然RepeatMasker和nHMMER有助于识别这些重复序列,但它们需要一个预先编译的重复序列库,而这个库并非总是可用。诸如Recon、RepeatScout或RepeatGluer等从头识别工具旨在纯粹从序列内容中识别TEs,但要么受到运行时间的限制(禁止全基因组使用),要么在存在破坏序列模式的替换时质量下降。
phRAIDER是一种从头TE识别工具,与竞争工具相比,它在不牺牲灵敏度的情况下解决了运行时间过长的问题。其基础模型是对基本重复序列的新定义,它结合了PatternHunter间隔种子模型,在存在基因组替换的情况下具有更高的灵敏度。与文献中的首要工具RepeatScout相比,phRAIDER在任何单个人类染色体上平均加速10倍,并且能够在短短三个多小时内处理整个人类基因组。在这里,我们讨论了该工具、其背后的理论模型以及证明其有效性的结果。
phRAIDER是一个开源工具,可从https://github.com/karroje/phRAIDER获取 联系方式:karroje@miamiOH.edu 或
补充数据可在《生物信息学》在线获取。