Quesneville Hadi, Bergman Casey M, Andrieu Olivier, Autard Delphine, Nouaud Danielle, Ashburner Michael, Anxolabehere Dominique
Laboratoire Dynamique du Génome et Evolution, Institut Jacques Monod, Paris, France.
PLoS Comput Biol. 2005 Jul;1(2):166-75. doi: 10.1371/journal.pcbi.0010022. Epub 2005 Jul 29.
Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated "TE models" in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila.
转座元件(TEs)是可移动的重复序列,在后生动物基因组中占相当大的比例。尽管它们在基因组和染色体生物学中几乎无处不在且十分重要,但在基因组序列中注释TEs的大多数工作都依赖于单个计算程序RepeatMasker的结果。相比之下,基因注释方面的最新进展表明,通过结合多个独立的计算证据来源,可以生成高质量的基因模型。为了将TE注释的质量提升到与基因模型相当的水平,我们开发了一种组合证据模型的TE注释流程,类似于用于基因注释的系统,通过整合多种基于同源性和从头识别TE的方法的结果。作为原理验证,我们使用从RepeatMasker、BLASTER、TBLASTX、全基因组比对BLASTN、RECON、TE-HMM以及之前的3.1版本注释中获得的组合计算证据,对黑腹果蝇4.0版本基因组序列中的“TE模型”进行了注释。我们的系统设计用于与Apollo基因组注释工具配合使用,允许手动整理自动生成的结果以产生可靠的注释。现在估计黑腹果蝇常染色质中的TE比例为5.3%(相比之下,3.1版本为3.86%),并且我们发现TE的数量(n = 6,013)比之前鉴定的(n = 1,572)要多得多。大多数新的TE来自几百个核苷酸长的小片段以及之前未注释的高度丰富的家族(例如INE-1)。我们还估计有518个TE拷贝(8.6%)插入到至少一个其他TE中,形成了一个元件嵌套。该流程甚至可以快速、全面地注释最复杂的TE模型,包括高度缺失和/或嵌套的元件,如那些常在异染色质序列中发现的元件。我们的流程可以很容易地适用于其他基因组序列,如黑腹果蝇异染色质或果蝇属其他物种的基因组序列。