Suppr超能文献

一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.

作者信息

Kurtz Stefan, Narechania Apurva, Stein Joshua C, Ware Doreen

机构信息

Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.

出版信息

BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.

Abstract

BACKGROUND

The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.

RESULTS

Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage ( approximately 0.45x) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.

CONCLUSION

The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer.

摘要

背景

在包含高度重复转座元件(TEs)的大型基因组中,准确的基因预测和计数面临的挑战进一步加剧。然而,转座元件在基因组进化中发挥着重要作用,其本身也是一个重要的研究对象。基于k-mer出现次数的重复注释先前已被用于区分转座元件和低拷贝基因区域;但目前可用的软件解决方案由于内存需求高或针对特定用户任务的专业化而不实用。

结果

在此,我们介绍了Tallymer软件,这是一个灵活且内存高效的程序集合,用于对大型序列集进行k-mer计数和索引。与先前的方法不同,Tallymer基于增强型后缀数组。这在k-mer大小的选择上提供了更大的灵活性。Tallymer可以处理数十亿碱基的大数据量。我们将其用于各种应用中,以研究玉米和其他植物物种的基因组。特别是,Tallymer被用于索引一组来自玉米(B73)的全基因组鸟枪法测序序列(总大小109 bp)。我们分析了广泛k值下的k-mer频率。在这种低基因组覆盖率(约0.45x)下,高度重复的20-mer构成了基因组的44%,但仅占所有可能k-mer的1%。在高粱和水稻的重复片段中也观察到了类似的低复杂性。当将我们的方法应用于其他玉米数据集时,高C0t衍生序列显示出对低拷贝序列的最大富集。在注释的转座元件中,最高度重复的是反转录转座子的Ty3/gypsy类,其次是Ty1/copia类和DNA转座子。在表达序列标签(EST)中,相当一部分包含高拷贝k-mer,这表明转座子在玉米中仍然活跃。使用B73 20-mer频率索引可以很容易地检测到Mo17和McC品种中的反转录转座子,这表明尽管不同品种间存在广泛的重排,但它们仍具有保守性。在100个注释的细菌人工染色体(BAC)中,与基于比对的重复序列屏蔽方法相比,k-mer频率可用于检测转座子编码基因,灵敏度为92%,而两种方法的特异性均为92%。

结论

尽管可用序列覆盖率相对较低带来了限制,但Tallymer软件在各种应用中有效地辅助了玉米基因组注释。有关该软件的更多信息,请参见http://www.zbh.uni-hamburg.de/Tallymer。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec96/2613927/7627aefb36cc/1471-2164-9-517-1.jpg

相似文献

2
Sequence composition, organization, and evolution of the core Triticeae genome.
Plant J. 2004 Nov;40(4):500-11. doi: 10.1111/j.1365-313X.2004.02228.x.
5
Enrichment of gene-coding sequences in maize by genome filtration.
Science. 2003 Dec 19;302(5653):2118-20. doi: 10.1126/science.1090047.
6
In-depth view of structure, activity, and evolution of rice chromosome 10.
Science. 2003 Jun 6;300(5625):1566-9. doi: 10.1126/science.1083523.
8
The maize genome as a model for efficient sequence analysis of large plant genomes.
Curr Opin Plant Biol. 2006 Apr;9(2):149-56. doi: 10.1016/j.pbi.2006.01.015. Epub 2006 Feb 3.

引用本文的文献

1
Mapping-based genome size estimation.
BMC Genomics. 2025 May 14;26(1):482. doi: 10.1186/s12864-025-11640-8.
2
Genomic garden: From societal and scientific impacts to biodiversity conservation.
Cell Genom. 2025 Apr 9;5(4):100779. doi: 10.1016/j.xgen.2025.100779. Epub 2025 Feb 27.
3
-mer approaches for biodiversity genomics.
Genome Res. 2025 Feb 14;35(2):219-230. doi: 10.1101/gr.279452.124.
5
Streamlining of Simple Sequence Repeat Data Mining Methodologies and Pipelines for Crop Scanning.
Plants (Basel). 2024 Sep 19;13(18):2619. doi: 10.3390/plants13182619.
7
A survey of k-mer methods and applications in bioinformatics.
Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.
8
MFPINC: prediction of plant ncRNAs based on multi-source feature fusion.
BMC Genomics. 2024 May 30;25(1):531. doi: 10.1186/s12864-024-10439-3.
9
Databases and computational methods for the identification of piRNA-related molecules: A survey.
Comput Struct Biotechnol J. 2024 Jan 22;23:813-833. doi: 10.1016/j.csbj.2024.01.011. eCollection 2024 Dec.
10

本文引用的文献

2
The impact of next-generation sequencing technology on genetics.
Trends Genet. 2008 Mar;24(3):133-41. doi: 10.1016/j.tig.2007.12.007. Epub 2008 Feb 11.
3
The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.
Nature. 2007 Sep 27;449(7161):463-7. doi: 10.1038/nature06148. Epub 2007 Aug 26.
4
MIPSPlantsDB--plant database resource for integrative and comparative plant genome research.
Nucleic Acids Res. 2007 Jan;35(Database issue):D834-40. doi: 10.1093/nar/gkl945.
5
Whole-genome re-sequencing.
Curr Opin Genet Dev. 2006 Dec;16(6):545-52. doi: 10.1016/j.gde.2006.10.009. Epub 2006 Oct 18.
6
The genome of black cottonwood, Populus trichocarpa (Torr. & Gray).
Science. 2006 Sep 15;313(5793):1596-604. doi: 10.1126/science.1128691.
7
Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice.
PLoS Comput Biol. 2006 Sep 1;2(9):e115. doi: 10.1371/journal.pcbi.0020115. Epub 2006 Jul 20.
8
The TIGR Maize Database.
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D771-6. doi: 10.1093/nar/gkj072.
9
Structure and architecture of the maize genome.
Plant Physiol. 2005 Dec;139(4):1612-24. doi: 10.1104/pp.105.068718.
10
ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun.
PLoS Comput Biol. 2005 Sep;1(4):e43. doi: 10.1371/journal.pcbi.0010043. Epub 2005 Sep 23.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验