一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.

作者信息

Kurtz Stefan, Narechania Apurva, Stein Joshua C, Ware Doreen

机构信息

Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.

出版信息

BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.

DOI:10.1186/1471-2164-9-517

PMID:18976482

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2613927/

Abstract

BACKGROUND

The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks.

RESULTS

Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage ( approximately 0.45x) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity.

CONCLUSION

The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see http://www.zbh.uni-hamburg.de/Tallymer.

摘要

背景

在包含高度重复转座元件（TEs）的大型基因组中，准确的基因预测和计数面临的挑战进一步加剧。然而，转座元件在基因组进化中发挥着重要作用，其本身也是一个重要的研究对象。基于k-mer出现次数的重复注释先前已被用于区分转座元件和低拷贝基因区域；但目前可用的软件解决方案由于内存需求高或针对特定用户任务的专业化而不实用。

结果

在此，我们介绍了Tallymer软件，这是一个灵活且内存高效的程序集合，用于对大型序列集进行k-mer计数和索引。与先前的方法不同，Tallymer基于增强型后缀数组。这在k-mer大小的选择上提供了更大的灵活性。Tallymer可以处理数十亿碱基的大数据量。我们将其用于各种应用中，以研究玉米和其他植物物种的基因组。特别是，Tallymer被用于索引一组来自玉米（B73）的全基因组鸟枪法测序序列（总大小109 bp）。我们分析了广泛k值下的k-mer频率。在这种低基因组覆盖率（约0.45x）下，高度重复的20-mer构成了基因组的44%，但仅占所有可能k-mer的1%。在高粱和水稻的重复片段中也观察到了类似的低复杂性。当将我们的方法应用于其他玉米数据集时，高C0t衍生序列显示出对低拷贝序列的最大富集。在注释的转座元件中，最高度重复的是反转录转座子的Ty3/gypsy类，其次是Ty1/copia类和DNA转座子。在表达序列标签（EST）中，相当一部分包含高拷贝k-mer，这表明转座子在玉米中仍然活跃。使用B73 20-mer频率索引可以很容易地检测到Mo17和McC品种中的反转录转座子，这表明尽管不同品种间存在广泛的重排，但它们仍具有保守性。在100个注释的细菌人工染色体（BAC）中，与基于比对的重复序列屏蔽方法相比，k-mer频率可用于检测转座子编码基因，灵敏度为92%，而两种方法的特异性均为92%。

结论

尽管可用序列覆盖率相对较低带来了限制，但Tallymer软件在各种应用中有效地辅助了玉米基因组注释。有关该软件的更多信息，请参见http://www.zbh.uni-hamburg.de/Tallymer。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec96/2613927/7627aefb36cc/1471-2164-9-517-1.jpg

相似文献

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。

BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.

Sequence composition, organization, and evolution of the core Triticeae genome.小麦族核心基因组的序列组成、组织及进化

Plant J. 2004 Nov;40(4):500-11. doi: 10.1111/j.1365-313X.2004.02228.x.

Analysis of common k-mers for whole genome sequences using SSB-tree.使用SSB树对全基因组序列的常见k-mer进行分析。

Genome Inform. 2002;13:30-41.

Genome-wide characterization of long terminal repeat -retrotransposons in apple reveals the differences in heterogeneity and copy number between Ty1-copia and Ty3-gypsy retrotransposons.苹果中长末端重复逆转座子的全基因组特征揭示了Ty1-copia和Ty3-gypsy逆转座子在异质性和拷贝数上的差异。

J Integr Plant Biol. 2008 Sep;50(9):1130-9. doi: 10.1111/j.1744-7909.2008.00717.x.

Enrichment of gene-coding sequences in maize by genome filtration.通过基因组过滤富集玉米中的基因编码序列。

Science. 2003 Dec 19;302(5653):2118-20. doi: 10.1126/science.1090047.

In-depth view of structure, activity, and evolution of rice chromosome 10.水稻第10号染色体的结构、活性及进化的深入观察

Science. 2003 Jun 6;300(5625):1566-9. doi: 10.1126/science.1083523.

JUICE: a data management system that facilitates the analysis of large volumes of information in an EST project workflow.JUICE：一个数据管理系统，可在EST项目工作流程中促进对大量信息的分析。

BMC Bioinformatics. 2006 Nov 23;7:513. doi: 10.1186/1471-2105-7-513.

The maize genome as a model for efficient sequence analysis of large plant genomes.玉米基因组作为大型植物基因组高效序列分析的模型。

Curr Opin Plant Biol. 2006 Apr;9(2):149-56. doi: 10.1016/j.pbi.2006.01.015. Epub 2006 Feb 3.

Plant Gene and Alternatively Spliced Variant Annotator. A plant genome annotation pipeline for rice gene and alternatively spliced variant identification with cross-species expressed sequence tag conservation from seven plant species.植物基因与可变剪接变体注释工具。一种用于水稻基因和可变剪接变体识别的植物基因组注释流程，利用来自七个植物物种的跨物种表达序列标签保守性。

Plant Physiol. 2007 Mar;143(3):1086-95. doi: 10.1104/pp.106.092460. Epub 2007 Jan 12.

A whole-genome snapshot of 454 sequences exposes the composition of the barley genome and provides evidence for parallel evolution of genome size in wheat and barley.454个序列的全基因组快照揭示了大麦基因组的组成，并为小麦和大麦基因组大小的平行进化提供了证据。

Plant J. 2009 Sep;59(5):712-22. doi: 10.1111/j.1365-313X.2009.03911.x. Epub 2009 May 12.

引用本文的文献

Mapping-based genome size estimation.基于图谱的基因组大小估计

BMC Genomics. 2025 May 14;26(1):482. doi: 10.1186/s12864-025-11640-8.

Genomic garden: From societal and scientific impacts to biodiversity conservation.基因组花园：从社会和科学影响到生物多样性保护

Cell Genom. 2025 Apr 9;5(4):100779. doi: 10.1016/j.xgen.2025.100779. Epub 2025 Feb 27.

-mer approaches for biodiversity genomics.用于生物多样性基因组学的-mer方法。

Genome Res. 2025 Feb 14;35(2):219-230. doi: 10.1101/gr.279452.124.

Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.基因组数据的高效存储与分析：一种k-mer频率映射与图像表示方法。

Interdiscip Sci. 2024 Oct 21. doi: 10.1007/s12539-024-00659-2.

Streamlining of Simple Sequence Repeat Data Mining Methodologies and Pipelines for Crop Scanning.简化用于作物扫描的简单序列重复数据挖掘方法和流程

Plants (Basel). 2024 Sep 19;13(18):2619. doi: 10.3390/plants13182619.

Chromosome-Level Genome Assembly of Voss (Coleoptera: Attelabidae): Insights into Evolution and Behavior.沃斯象鼻虫（鞘翅目：象鼻虫科）的染色体水平基因组组装：对进化与行为的洞察

Insects. 2024 Jun 6;15(6):431. doi: 10.3390/insects15060431.

A survey of k-mer methods and applications in bioinformatics.生物信息学中k-mer方法及其应用综述。

Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.

MFPINC: prediction of plant ncRNAs based on multi-source feature fusion.MFPINC：基于多源特征融合的植物 ncRNAs 预测。

BMC Genomics. 2024 May 30;25(1):531. doi: 10.1186/s12864-024-10439-3.

Databases and computational methods for the identification of piRNA-related molecules: A survey.用于鉴定piRNA相关分子的数据库和计算方法：一项综述。

Comput Struct Biotechnol J. 2024 Jan 22;23:813-833. doi: 10.1016/j.csbj.2024.01.011. eCollection 2024 Dec.

Genome-wide analysis of horizontal transfer in non-model wild species from a natural ecosystem reveals new insights into genetic exchange in plants.对自然生态系统中非模式野生物种水平转移的全基因组分析揭示了植物基因交流的新见解。

PLoS Genet. 2023 Oct 19;19(10):e1010964. doi: 10.1371/journal.pgen.1010964. eCollection 2023 Oct.

本文引用的文献

Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats.大麦基因组的低通量鸟枪法测序有助于快速鉴定基因、保守非编码序列和新型重复序列。

BMC Genomics. 2008 Oct 31;9:518. doi: 10.1186/1471-2164-9-518.

The impact of next-generation sequencing technology on genetics.下一代测序技术对遗传学的影响。

Trends Genet. 2008 Mar;24(3):133-41. doi: 10.1016/j.tig.2007.12.007. Epub 2008 Feb 11.

The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.葡萄基因组序列表明主要被子植物门中存在祖先六倍体化现象。

Nature. 2007 Sep 27;449(7161):463-7. doi: 10.1038/nature06148. Epub 2007 Aug 26.

MIPSPlantsDB--plant database resource for integrative and comparative plant genome research.MIPS植物数据库——用于综合和比较植物基因组研究的植物数据库资源。

Nucleic Acids Res. 2007 Jan;35(Database issue):D834-40. doi: 10.1093/nar/gkl945.

Whole-genome re-sequencing.全基因组重测序

Curr Opin Genet Dev. 2006 Dec;16(6):545-52. doi: 10.1016/j.gde.2006.10.009. Epub 2006 Oct 18.

The genome of black cottonwood, Populus trichocarpa (Torr. & Gray).黑杨（毛果杨，Populus trichocarpa (Torr. & Gray)）的基因组。

Science. 2006 Sep 15;313(5793):1596-604. doi: 10.1126/science.1128691.

PLoS Comput Biol. 2006 Sep 1;2(9):e115. doi: 10.1371/journal.pcbi.0020115. Epub 2006 Jul 20.

The TIGR Maize Database.TIGR玉米数据库。

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D771-6. doi: 10.1093/nar/gkj072.

Structure and architecture of the maize genome.玉米基因组的结构与架构。

Plant Physiol. 2005 Dec;139(4):1612-24. doi: 10.1104/pp.105.068718.

ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun.ReAS：从全基因组鸟枪法测序的未组装读段中恢复转座元件的祖先序列。

PLoS Comput Biol. 2005 Sep;1(4):e43. doi: 10.1371/journal.pcbi.0010043. Epub 2005 Sep 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献