Smith Christopher D, Edgar Robert C, Yandell Mark D, Smith Douglas R, Celniker Susan E, Myers Eugene W, Karpen Gary H
Department of Biology, San Francisco State University, San Francisco, CA, United States.
Gene. 2007 Mar 1;389(1):1-9. doi: 10.1016/j.gene.2006.09.011. Epub 2006 Oct 12.
Repetitive sequences are a major constituent of many eukaryote genomes and play roles in gene regulation, chromosome inheritance, nuclear architecture, and genome stability. The identification of repetitive elements has traditionally relied on in-depth, manual curation and computational determination of close relatives based on DNA identity. However, the rapid divergence of repetitive sequence has made identification of repeats by DNA identity difficult even in closely related species. Hence, the presence of unidentified repeats in genome sequences affects the quality of gene annotations and annotation-dependent analyses (e.g. microarray analyses). We have developed an enhanced repeat identification pipeline using two approaches. First, the de novo repeat finding program PILER-DF was used to identify interspersed repetitive elements in several recently finished Dipteran genomes. Repeats were classified, when possible, according to their similarity to known elements described in Repbase and GenBank, and also screened against annotated genes as one means of eliminating false positives. Second, we used a new program called RepeatRunner, which integrates results from both RepeatMasker nucleotide searches and protein searches using BLASTX. Using RepeatRunner with PILER-DF predictions, we masked repeats in thirteen Dipteran genomes and conclude that combining PILER-DF and RepeatRunner greatly enhances repeat identification in both well-characterized and un-annotated genomes.
重复序列是许多真核生物基因组的主要组成部分,在基因调控、染色体遗传、核结构和基因组稳定性中发挥作用。传统上,重复元件的鉴定依赖于基于DNA同一性的深入手动整理和近缘物种的计算确定。然而,重复序列的快速分化使得即使在亲缘关系密切的物种中,通过DNA同一性鉴定重复序列也变得困难。因此,基因组序列中未鉴定重复序列的存在会影响基因注释的质量以及依赖注释的分析(例如微阵列分析)。我们使用两种方法开发了一种增强的重复序列鉴定流程。首先,使用从头重复序列发现程序PILER-DF来鉴定几个最近完成的双翅目基因组中的散布重复元件。可能的话,根据它们与Repbase和GenBank中描述的已知元件的相似性对重复序列进行分类,并针对注释基因进行筛选,作为消除假阳性的一种手段。其次,我们使用了一个名为RepeatRunner的新程序,该程序整合了RepeatMasker核苷酸搜索和使用BLASTX的蛋白质搜索结果。将RepeatRunner与PILER-DF预测结果结合使用,我们对13个双翅目基因组中的重复序列进行了屏蔽,并得出结论,将PILER-DF和RepeatRunner结合使用可以大大增强在特征明确和未注释基因组中的重复序列鉴定。