Kurtz Stefan, Narechania Apurva, Stein Joshua C, Ware Doreen
Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.
BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.
The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.
Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage ( approximately 0.45x) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.
The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer.
在包含高度重复转座元件(TEs)的大型基因组中,准确的基因预测和计数面临的挑战进一步加剧。然而,转座元件在基因组进化中发挥着重要作用,其本身也是一个重要的研究对象。基于k-mer出现次数的重复注释先前已被用于区分转座元件和低拷贝基因区域;但目前可用的软件解决方案由于内存需求高或针对特定用户任务的专业化而不实用。
在此,我们介绍了Tallymer软件,这是一个灵活且内存高效的程序集合,用于对大型序列集进行k-mer计数和索引。与先前的方法不同,Tallymer基于增强型后缀数组。这在k-mer大小的选择上提供了更大的灵活性。Tallymer可以处理数十亿碱基的大数据量。我们将其用于各种应用中,以研究玉米和其他植物物种的基因组。特别是,Tallymer被用于索引一组来自玉米(B73)的全基因组鸟枪法测序序列(总大小109 bp)。我们分析了广泛k值下的k-mer频率。在这种低基因组覆盖率(约0.45x)下,高度重复的20-mer构成了基因组的44%,但仅占所有可能k-mer的1%。在高粱和水稻的重复片段中也观察到了类似的低复杂性。当将我们的方法应用于其他玉米数据集时,高C0t衍生序列显示出对低拷贝序列的最大富集。在注释的转座元件中,最高度重复的是反转录转座子的Ty3/gypsy类,其次是Ty1/copia类和DNA转座子。在表达序列标签(EST)中,相当一部分包含高拷贝k-mer,这表明转座子在玉米中仍然活跃。使用B73 20-mer频率索引可以很容易地检测到Mo17和McC品种中的反转录转座子,这表明尽管不同品种间存在广泛的重排,但它们仍具有保守性。在100个注释的细菌人工染色体(BAC)中,与基于比对的重复序列屏蔽方法相比,k-mer频率可用于检测转座子编码基因,灵敏度为92%,而两种方法的特异性均为92%。
尽管可用序列覆盖率相对较低带来了限制,但Tallymer软件在各种应用中有效地辅助了玉米基因组注释。有关该软件的更多信息,请参见http://www.zbh.uni-hamburg.de/Tallymer。