水稻基因组的转座元件注释

Transposable element annotation of the rice genome.

作者信息

Juretic Nikoleta, Bureau Thomas E, Bruskiewich Richard M

机构信息

Department of Biology, McGill University, Montreal, Quebec, H3A 1B1 Canada.

出版信息

Bioinformatics. 2004 Jan 22;20(2):155-60. doi: 10.1093/bioinformatics/bth019.

DOI:10.1093/bioinformatics/bth019

PMID:14734305

Abstract

MOTIVATION

The high content of repetitive sequences in the genomes of many higher eukaryotes renders the task of annotating them computationally intensive. Presently, the only widely accepted method of searching and annotating transposable elements (TEs) in large genomic sequences is the use of the RepeatMasker program, which identifies new copies of TEs by pairwise sequence comparisons with a library of known TEs. Profile hidden Markov models (HMMs) have been used successfully in discovering distant homologs of known proteins in large protein databases, but this approach has only rarely been applied to known model TE families in genomic DNA.

RESULTS

We used a combination of computational approaches to annotate the TEs in the finished genome of Oryza sativa ssp. japonica. In this paper, we discuss the strengths and the weaknesses of the annotation methods used. These approaches included: the default configuration of RepeatMasker using cross_match, an implementation of the Smith-Waterman-Gotoh algorithm; RepeatMasker using WU-BLAST for similarity searching; and the HMMER package, used to search for TEs with profile HMMs. All the results were converted into GFF format and post-processed using a set of Perl scripts. RepeatMasker was used in the case of most TE families. The WU-BLAST implementation of RepeatMasker was found to be manifold faster than cross_match with only a slight loss in sensitivity and was thus used to obtain the final set of data. HMMER was used in the annotation of the Mutator-like element (MULE) superfamily and the miniature inverted-repeat transposable element (MITE) polyphyletic group of families, for which large libraries of elements were available and which could be divided into well-defined families. The HMMER search algorithm was extremely slow for models over 1000 bp in length, so MULE families with members over 1000 bp long were processed with RepeatMasker instead. The main disadvantage of HMMER in this application is that, since it was developed with protein sequences in mind, it does not search the negative DNA strand. With the exception of TE families with essentially palindromic sequences, reverse complement models had to be created and run to compensate for this shortcoming. We conclude that a modification of RepeatMasker to incorporate libraries of profile HMMs in searches could improve the ability to detect degenerated copies of TEs.

AVAILABILITY

The Perl scripts and TE sequences used in construction of the RepeatMasker library and the profile HMMs are available upon request.

摘要

动机

许多高等真核生物基因组中重复序列的高含量使得对其进行计算注释的任务非常耗时。目前，在大型基因组序列中搜索和注释转座元件（TEs）的唯一广泛接受的方法是使用RepeatMasker程序，该程序通过与已知TEs库进行成对序列比较来识别TEs的新拷贝。轮廓隐马尔可夫模型（HMMs）已成功用于在大型蛋白质数据库中发现已知蛋白质的远源同源物，但这种方法很少应用于基因组DNA中的已知模型TE家族。

结果

我们使用了多种计算方法对水稻粳稻亚种的完成基因组中的TEs进行注释。在本文中，我们讨论了所使用注释方法的优缺点。这些方法包括：使用cross_match的RepeatMasker默认配置，这是Smith-Waterman-Gotoh算法的一种实现；使用WU-BLAST进行相似性搜索的RepeatMasker；以及用于使用轮廓HMM搜索TEs的HMMER软件包。所有结果都转换为GFF格式，并使用一组Perl脚本进行后处理。大多数TE家族的注释使用了RepeatMasker。发现RepeatMasker的WU-BLAST实现比cross_match快得多，只是灵敏度略有损失，因此用于获取最终数据集。HMMER用于注释类Mutator元件（MULE）超家族和微型反向重复转座元件（MITE）多系家族组，对于这些家族有大量的元件库，并且可以分为定义明确的家族。对于长度超过1000 bp的模型，HMMER搜索算法极其缓慢，因此长度超过1000 bp的MULE家族成员用RepeatMasker进行处理。HMMER在该应用中的主要缺点是，由于它是考虑蛋白质序列开发的，因此它不搜索负链DNA。除了基本为回文序列的TE家族外，必须创建并运行反向互补模型来弥补这一缺点。我们得出结论，对RepeatMasker进行修改以在搜索中纳入轮廓HMM库可以提高检测TEs退化拷贝的能力。