Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1 N4, Canada.
Mob DNA. 2013 Dec 20;4(1):28. doi: 10.1186/1759-8753-4-28.
Accurate and complete identification of mobile elements is a challenging task in the current era of sequencing, given their large numbers and frequent truncations. Group II intron retroelements, which consist of a ribozyme and an intron-encoded protein (IEP), are usually identified in bacterial genomes through their IEP; however, the RNA component that defines the intron boundaries is often difficult to identify because of a lack of strong sequence conservation corresponding to the RNA structure. Compounding the problem of boundary definition is the fact that a majority of group II intron copies in bacteria are truncated.
Here we present a pipeline of 11 programs that collect and analyze group II intron sequences from GenBank. The pipeline begins with a BLAST search of GenBank using a set of representative group II IEPs as queries. Subsequent steps download the corresponding genomic sequences and flanks, filter out non-group II introns, assign introns to phylogenetic subclasses, filter out incomplete and/or non-functional introns, and assign IEP sequences and RNA boundaries to the full-length introns. In the final step, the redundancy in the data set is reduced by grouping introns into sets of ≥95% identity, with one example sequence chosen to be the representative.
These programs should be useful for comprehensive identification of group II introns in sequence databases as data continue to rapidly accumulate.
在当前测序时代,由于移动元件数量庞大且频繁发生截断,因此准确、完整地识别移动元件是一项具有挑战性的任务。由核酶和内含子编码蛋白(IEP)组成的 II 组内含子 retroelements 通常通过它们的 IEP 在细菌基因组中被识别;然而,由于与 RNA 结构对应的强序列保守性缺失,通常难以识别定义内含子边界的 RNA 成分。使边界定义问题更加复杂的是,细菌中大多数 II 组内含子副本都被截断。
在这里,我们提出了一个由 11 个程序组成的管道,用于从 GenBank 中收集和分析 II 组内含子序列。该管道首先使用一组代表性的 II 组 IEP 作为查询在 GenBank 中进行 BLAST 搜索。随后的步骤下载相应的基因组序列和侧翼序列,过滤掉非 II 组内含子,将内含子分配到系统发育子类,过滤掉不完整和/或非功能性内含子,并将 IEP 序列和 RNA 边界分配给全长内含子。在最后一步中,通过将内含子分组为具有≥95%同一性的集合来减少数据集的冗余,选择一个示例序列作为代表。
随着数据的快速积累,这些程序应该有助于在序列数据库中全面识别 II 组内含子。