Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.
Nucleic Acids Res. 2009 Nov;37(21):7002-13. doi: 10.1093/nar/gkp759.
Long terminal repeat (LTR) retrotransposons and endogenous retroviruses (ERVs) are transposable elements in eukaryotic genomes well suited for computational identification. De novo identification tools determine the position of potential LTR retrotransposon or ERV insertions in genomic sequences. For further analysis, it is desirable to obtain an annotation of the internal structure of such candidates. This article presents LTRdigest, a novel software tool for automated annotation of internal features of putative LTR retrotransposons. It uses local alignment and hidden Markov model-based algorithms to detect retrotransposon-associated protein domains as well as primer binding sites and polypurine tracts. As an example, we used LTRdigest results to identify 88 (near) full-length ERVs in the chromosome 4 sequence of Mus musculus, separating them from truncated insertions and other repeats. Furthermore, we propose a work flow for the use of LTRdigest in de novo LTR retrotransposon classification and perform an exemplary de novo analysis on the Drosophila melanogaster genome as a proof of concept. Using a new method solely based on the annotations generated by LTRdigest, 518 potential LTR retrotransposons were automatically assigned to 62 candidate groups. Representative sequences from 41 of these 62 groups were matched to reference sequences with >80% global sequence similarity.
长末端重复序列(LTR)反转录转座子和内源性逆转录病毒(ERVs)是真核基因组中的转座元件,非常适合于计算鉴定。从头鉴定工具确定基因组序列中潜在 LTR 反转录转座子或 ERV 插入的位置。为了进一步分析,理想情况下需要获得此类候选物内部结构的注释。本文介绍了 LTRdigest,这是一种用于自动注释推定 LTR 反转录转座子内部特征的新型软件工具。它使用局部比对和基于隐马尔可夫模型的算法来检测与反转录转座子相关的蛋白结构域以及引物结合位点和多聚嘧啶区。例如,我们使用 LTRdigest 的结果在 Mus musculus 染色体 4 序列中鉴定了 88 个(近)全长 ERVs,将它们与截短的插入物和其他重复物区分开来。此外,我们提出了一种在从头鉴定 LTR 反转录转座子分类中使用 LTRdigest 的工作流程,并以 Drosophila melanogaster 基因组为例进行了示范分析。使用仅基于 LTRdigest 生成的注释的新方法,自动将 518 个潜在的 LTR 反转录转座子分配到 62 个候选组中。这些 62 个组中的 41 个组的代表序列与具有 >80%全局序列相似性的参考序列相匹配。