用于功能和进化特征识别的基因组DNA序列的K-mer含量、相关性及位置分析

K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features.

作者信息

Sievers Aaron, Bosiek Katharina, Bisch Marc, Dreessen Chris, Riedel Jascha, Froß Patrick, Hausmann Michael, Hildenbrand Georg

机构信息

Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.

Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany.

出版信息

Genes (Basel). 2017 Apr 19;8(4):122. doi: 10.3390/genes8040122.

DOI:10.3390/genes8040122

PMID:28422050

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5406869/

Abstract

In genome analysis, -based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve -mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local -mer spectra (frequency distribution of -mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ ≤ 4) on relatively small viral genomes of and , while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in and formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the -mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown -mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest -mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard -mer analysis.

摘要

在基因组分析中，基于 - 聚体的比较方法已成为标准工具。然而，尽管它们能够提供可靠的结果，但其他算法在某些情况下似乎效果更好。为了改进基于 - 聚体的DNA序列分析和比较，我们成功检验了添加位置分辨率对于发现和/或比较有趣的组织结构是否有益。我们开发并使用了一种简单而高效的算法来提取和保存局部 - 聚体谱（- 聚体的频率分布）。通过将基于可视化的位置信息纳入基因组图谱，并应用基本的向量相关方法来分析结果。该分析集中在长度为1≤ ≤4的小单词长度上，针对相对较小的和的病毒基因组，同时也检验了其对较大序列（即人类2号染色体和黑猩猩的同源染色体（2A、2B））的适用性。使用这种无需比对的分析方法，通过独立的、大多基于比对的方法先前鉴定出的和中具有特定特征的几个区域得到了确认。已经发现了这些基因组中 - 聚体含量与几个基因之间的相关性，显示了分类病毒和未分类病毒之间的相似性，这可能对进一步的分类学研究有潜在帮助。此外，还发现并描述了人类疱疹病毒（HHV）基因组中可能具有主要生物学功能的未知 - 聚体相关性。利用目前已知的黑猩猩和人类的染色体，再现了每个分析染色体上物种之间的一致性。这证明了我们的方法对于复杂基因组大数据集的可行性。基于这些结果，我们建议将具有位置分辨率的 - 聚体分析作为一种方法，以弥合基于比对的方法（如NCBI BLAST）的有效性与标准 - 聚体分析的高速度之间的差距。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4cea/5406869/853bfad9c2f6/genes-08-00122-g001.jpg

相似文献

K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features.

Genes (Basel). 2017 Apr 19;8(4):122. doi: 10.3390/genes8040122.

K-mer-Based Motif Analysis in Insect Species across , , and Genera and Its Application to Species Classification.

Comput Math Methods Med. 2019 Nov 15;2019:4259479. doi: 10.1155/2019/4259479. eCollection 2019.

Genome classification improvements based on k-mer intervals in sequences.

Genomics. 2019 Dec;111(6):1574-1582. doi: 10.1016/j.ygeno.2018.11.001. Epub 2018 Nov 13.

Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.

BMC Bioinformatics. 2016 Jan 16;17:38. doi: 10.1186/s12859-015-0875-7.

KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis.

Front Bioeng Biotechnol. 2020 Sep 23;8:556413. doi: 10.3389/fbioe.2020.556413. eCollection 2020.

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.

BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.

KmerAperture: Retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes.

PLoS Genet. 2024 Apr 29;20(4):e1011184. doi: 10.1371/journal.pgen.1011184. eCollection 2024 Apr.

kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.

J Comput Biol. 2022 Sep;29(9):1001-1021. doi: 10.1089/cmb.2021.0536. Epub 2022 May 20.

KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences.

Mol Phylogenet Evol. 2023 Feb;179:107662. doi: 10.1016/j.ympev.2022.107662. Epub 2022 Nov 11.

A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes.

BMC Genomics. 2023 May 18;24(1):266. doi: 10.1186/s12864-023-09373-7.

引用本文的文献

Positional frequency chaos game representation for machine learning-based classification of crop lncRNAs.

bioRxiv. 2025 Jun 7:2025.06.03.657533. doi: 10.1101/2025.06.03.657533.

Peculiar -mer Spectra Are Correlated with 3D Contact Frequencies and Breakpoint Regions in the Human Genome.

Genes (Basel). 2024 Sep 25;15(10):1247. doi: 10.3390/genes15101247.

Specific Patterns in Correlations of Super-Short Tandem Repeats (SSTRs) with G+C Content, Genic and Intergenic Regions, and Retrotransposons on All Human Chromosomes.

Genes (Basel). 2023 Dec 25;15(1):33. doi: 10.3390/genes15010033.

Spatial-Temporal Genome Regulation in Stress-Response and Cell-Fate Change.

Int J Mol Sci. 2023 Jan 31;24(3):2658. doi: 10.3390/ijms24032658.

Defining the characteristics of interferon-alpha-stimulated human genes: insight from expression data and machine learning.

Gigascience. 2022 Nov 18;11. doi: 10.1093/gigascience/giac103.

Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins.

Nat Commun. 2022 Jul 6;13(1):3880. doi: 10.1038/s41467-022-31564-1.

WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs.

PLoS One. 2022 Apr 15;17(4):e0267106. doi: 10.1371/journal.pone.0267106. eCollection 2022.

Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.

IEEE Open J Eng Med Biol. 2020 Jul 14;1:214-219. doi: 10.1109/OJEMB.2020.3009055. eCollection 2020.

PlasmidHostFinder: Prediction of Plasmid Hosts Using Random Forest.

mSystems. 2022 Apr 26;7(2):e0118021. doi: 10.1128/msystems.01180-21. Epub 2022 Apr 6.

Eukaryotic Genomes Show Strong Evolutionary Conservation of -mer Composition and Correlation Contributions between Introns and Intergenic Regions.

Genes (Basel). 2021 Oct 1;12(10):1571. doi: 10.3390/genes12101571.

本文引用的文献

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences.

Sci Rep. 2015 May 18;5:10203. doi: 10.1038/srep10203.

Resolving prokaryotic taxonomy without rRNA: longer oligonucleotide word lengths improve genome and metagenome taxonomic classification.

PLoS One. 2013 Jul 1;8(7):e67337. doi: 10.1371/journal.pone.0067337. Print 2013.

Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes.

Nucleic Acids Res. 2013 May;41(9):4783-91. doi: 10.1093/nar/gkt144. Epub 2013 Mar 21.

Next-generation phylogenomics.

Biol Direct. 2013 Jan 22;8:3. doi: 10.1186/1745-6150-8-3.

Evolution of sexually transmitted and sexually transmissible human herpesviruses.

Ann N Y Acad Sci. 2011 Aug;1230:E37-49. doi: 10.1111/j.1749-6632.2011.06358.x.

Database resources of the National Center for Biotechnology Information.

Nucleic Acids Res. 2012 Jan;40(Database issue):D13-25. doi: 10.1093/nar/gkr1184. Epub 2011 Dec 2.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics. 2011 Mar 15;27(6):764-70. doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

Barcodes for genomes and applications.

BMC Bioinformatics. 2008 Dec 17;9:546. doi: 10.1186/1471-2105-9-546.

Malaria research in the post-genomic era.

Nature. 2008 Oct 9;455(7214):751-6. doi: 10.1038/nature07361.

Both selective and neutral processes drive GC content evolution in the human genome.

BMC Evol Biol. 2008 Mar 27;8:99. doi: 10.1186/1471-2148-8-99.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于功能和进化特征识别的基因组DNA序列的K-mer含量、相关性及位置分析

K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献