大规模基于 k-mer 的基因组信息特性分析、比较基因组学和分类学。

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.

机构信息

Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel.

Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel.

出版信息

PLoS One. 2021 Oct 14;16(10):e0258693. doi: 10.1371/journal.pone.0258693. eCollection 2021.

DOI:10.1371/journal.pone.0258693

PMID:34648558

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8516232/

Abstract

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.

摘要

信息论方法在各种生物信息学应用中无处不在且非常有效。在比较基因组学中，基于短 DNA 单词或 k-mer 的无比对方法特别强大。我们通过分析 KEGG GENOME 数据库中 5805 个基因组的序列空间覆盖范围，评估了不同 k-mer 长度在基因组比较中的应用。在对跨越相关范围的四个 k-mer 长度（11、21、31 和 41）的后续分析中，使用成对的 21 和 31-mer Jaccard 相似性对 1634 个属水平代表基因组进行层次聚类，最好地再现了具有明确超界域边界的系统发育/分类树，并且在较低级别（从科到门）的命名分类群中具有较高的子树相似性。通过对其最低共同祖先分类群水平的约 1420 万个原核基因组进行分析，我们在一个经过精心整理的数据库中检测到许多潜在的错误分类错误，进一步证明需要广泛采用基于全基因组相似性的定量分类学分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f3c8/8516232/00f14e0e2752/pone.0258693.g001.jpg

相似文献

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.大规模基于 k-mer 的基因组信息特性分析、比较基因组学和分类学。

PLoS One. 2021 Oct 14;16(10):e0258693. doi: 10.1371/journal.pone.0258693. eCollection 2021.

Quantitatively Partitioning Microbial Genomic Traits among Taxonomic Ranks across the Microbial Tree of Life.定量划分生命之树上的微生物分类等级中的微生物基因组特征。

mSphere. 2019 Aug 28;4(4):e00446-19. doi: 10.1128/mSphere.00446-19.

Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2.使用 AMPHORA2 进行细菌和古菌序列的系统发育基因组分析。

Bioinformatics. 2012 Apr 1;28(7):1033-4. doi: 10.1093/bioinformatics/bts079. Epub 2012 Feb 12.

CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy.用于基于全基因组和无比对的原核生物系统发育与分类的CVTree3网络服务器。

Genomics Proteomics Bioinformatics. 2015 Oct;13(5):321-31. doi: 10.1016/j.gpb.2015.08.004. Epub 2015 Nov 10.

En route to a genome-based classification of Archaea and Bacteria?是否正在通往基于基因组的古菌和细菌分类的路上？

Syst Appl Microbiol. 2010 Jun;33(4):175-82. doi: 10.1016/j.syapm.2010.03.003. Epub 2010 Apr 20.

Inferring Phylogenomic Relationship of Microbes Using Scalable Alignment-Free Methods.利用可扩展的无比对方法推断微生物的系统发育关系。

Methods Mol Biol. 2021;2242:69-76. doi: 10.1007/978-1-0716-1099-2_5.

On K-peptide length in composition vector phylogeny of prokaryotes.关于原核生物组成向量系统发育中的K肽长度

Comput Biol Chem. 2014 Dec;53 Pt A:166-73. doi: 10.1016/j.compbiolchem.2014.08.021. Epub 2014 Aug 20.

The BISMiS 2011 special issue on prokaryotic systematics, a vital discipline entering a period of transition.《细菌系统分类学国际学报》2011年关于原核生物系统分类学的特刊，这一重要学科正进入一个转型期。

Antonie Van Leeuwenhoek. 2012 Jan;101(1):1-2. doi: 10.1007/s10482-011-9674-y. Epub 2011 Nov 12.

Weighted genome trees: refinements and applications.加权基因组树：优化与应用

J Bacteriol. 2005 Feb;187(4):1305-16. doi: 10.1128/JB.187.4.1305-1316.2005.

A complete domain-to-species taxonomy for Bacteria and Archaea.细菌和古菌的完整域到种分类 taxonomy。

Nat Biotechnol. 2020 Sep;38(9):1079-1086. doi: 10.1038/s41587-020-0501-8. Epub 2020 Apr 27.

引用本文的文献

Energy entropy vector: a novel approach for efficient microbial genomic sequence analysis and classification.能量熵向量：一种用于高效微生物基因组序列分析和分类的新方法。

Brief Bioinform. 2025 Sep 6;26(5). doi: 10.1093/bib/bbaf459.

Tetranucleotide frequencies differentiate genomic boundaries and metabolic strategies across environmental microbiomes.四核苷酸频率可区分不同环境微生物群落的基因组边界和代谢策略。

mSystems. 2025 Jul 8:e0174424. doi: 10.1128/msystems.01744-24.

EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf267.

K-mer-based Approaches to Bridging Pangenomics and Population Genetics.基于K-mer的泛基因组学与群体遗传学关联方法。

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.

bpRNA-CosMoS: a robust and efficient RNA structural comparison method using k-mer based cosine similarity.bpRNA-CosMoS：一种基于k-mer余弦相似度的强大且高效的RNA结构比较方法。

Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf108.

Rapid, reference-free identification of bacterial pathogen transmission using optimized split -mer analysis.使用优化的拆分词分析法快速、无需参考地鉴定细菌病原体传播

Microb Genom. 2025 Mar;11(3). doi: 10.1099/mgen.0.001347.

EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。

bioRxiv. 2025 Feb 23:2025.02.23.639716. doi: 10.1101/2025.02.23.639716.

Species annotation using a k-mer based KNN model.使用基于k-mer的K近邻模型进行物种注释。

Bioinformation. 2024 Sep 30;20(9):986-989. doi: 10.6026/973206300200986. eCollection 2024.

MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks.MDFGNN-SMMA：基于多源数据融合和图神经网络的潜在小分子- miRNA关联预测

BMC Bioinformatics. 2025 Jan 13;26(1):13. doi: 10.1186/s12859-025-06040-4.

Identification of the shortest species-specific oligonucleotide sequences.最短物种特异性寡核苷酸序列的鉴定。

Genome Res. 2025 Feb 14;35(2):279-295. doi: 10.1101/gr.280070.124.

本文引用的文献

Information theoretic generalized Robinson-Foulds metrics for comparing phylogenetic trees.基于信息论的广义 Robinson-Foulds 度量在比较系统发生树中的应用。

Bioinformatics. 2020 Dec 22;36(20):5007-5013. doi: 10.1093/bioinformatics/btaa614.

A complete domain-to-species taxonomy for Bacteria and Archaea.细菌和古菌的完整域到种分类 taxonomy。

Nat Biotechnol. 2020 Sep;38(9):1079-1086. doi: 10.1038/s41587-020-0501-8. Epub 2020 Apr 27.

Cultured Asgard Archaea Shed Light on Eukaryogenesis.培养的古菌为真核生物起源提供了线索。

Cell. 2020 Apr 16;181(2):232-235. doi: 10.1016/j.cell.2020.03.058.

Benchmarking of alignment-free sequence comparison methods.无比对信息的序列比较方法的基准测试。

Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.

Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.2307个变形菌门基因组的全蛋白质组聚类揭示了保守蛋白质和重大注释问题。

Front Microbiol. 2019 Feb 28;10:383. doi: 10.3389/fmicb.2019.00383. eCollection 2019.

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.高通量 ANI 分析 9 万余组原核基因组揭示了清晰的物种界限。

Nat Commun. 2018 Nov 30;9(1):5114. doi: 10.1038/s41467-018-07641-9.

A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life.基于基因组系统发育的标准化细菌分类学极大地改变了生命之树。

Nat Biotechnol. 2018 Nov;36(10):996-1004. doi: 10.1038/nbt.4229. Epub 2018 Aug 27.

Correlation between bacterial G+C content, genome size and the G+C content of associated plasmids and bacteriophages.细菌 G+C 含量、基因组大小与相关质粒和噬菌体 G+C 含量的相关性。

Microb Genom. 2018 Apr;4(4). doi: 10.1099/mgen.0.000168. Epub 2018 Apr 10.

Asgard archaea are the closest prokaryotic relatives of eukaryotes.阿斯加德古菌是真核生物最接近的原核生物亲属。

PLoS Genet. 2018 Mar 29;14(3):e1007080. doi: 10.1371/journal.pgen.1007080. eCollection 2018 Mar.

myTAI: evolutionary transcriptomics with R.我的TAI：使用R进行进化转录组学研究

Bioinformatics. 2018 May 1;34(9):1589-1590. doi: 10.1093/bioinformatics/btx835.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大规模基于 k-mer 的基因组信息特性分析、比较基因组学和分类学。

Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献