Suppr
超能文献

kmer2vec：一种基于 word2vec 嵌入的 DNA 序列比较新方法。

kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.

机构信息

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, Illinois, USA.

出版信息

J Comput Biol. 2022 Sep;29(9):1001-1021. doi: 10.1089/cmb.2021.0536. Epub 2022 May 20.

DOI:10.1089/cmb.2021.0536

PMID:35593919

Abstract

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning sequences becomes computationally intractable when increases due to the intrinsic computational complexity of MSA. Despite numerous -mer alignment-free methods being proposed, the existing -mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel -mer contextual alignment-free method (called kmer2vec), in which the sequence -mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of -mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional -mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the -mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.

摘要

序列比对在基因组学分析中具有重要意义。虽然传统的多重序列比对（MSA）方法常用于进化分析，但由于 MSA 的固有计算复杂性，当增加时，最佳对齐序列在计算上变得难以处理。尽管已经提出了许多无 -mer 比对方法，但现有的无 -mer 比对方法可能无法真正捕捉到序列的上下文结构。在本研究中，我们提出了一种新的无 -mer 上下文无比对方法（称为 kmer2vec），其中序列的 -mers 被语义嵌入到 word2vec 向量中，这是自然语言处理中的一项重要技术。因此，该方法将每个 DNA/RNA 序列转换为 word2vec 高维空间中的一个点，并在空间中比较 DNA 序列。由于 word2vec 向量是从基因组中 -mers 的上下文关系中训练得到的，因此该方法可以从序列中提取有价值的结构信息，并正确反映它们之间的关系。该方法在 word2vec 训练的参数上进行了优化，并在包括冠状病毒和细菌基因组在内的大型全基因组的系统发育分析中进行了验证。结果表明该方法在构建系统发育树和物种聚类方面的有效性。该方法的运行速度比 MSA 方法快得多，特别是 kmer2vec 方法构建的系统发育关系比传统的无 -mer 比对方法更准确。因此，这种方法可以为系统发育和进化提供新的视角，并使其有可能分析大型基因组。此外，我们还讨论了 -mer word2vec 嵌入构建中的特殊参数化。当将 kmer2vec 与聚类方法结合使用时，还可以衍生出一种快速 SARS-CoV-2 分型的有效工具。

相似文献

kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.

J Comput Biol. 2022 Sep;29(9):1001-1021. doi: 10.1089/cmb.2021.0536. Epub 2022 May 20.

A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes.

BMC Genomics. 2023 May 18;24(1):266. doi: 10.1186/s12864-023-09373-7.

K-mer natural vector and its application to the phylogenetic analysis of genetic sequences.

Gene. 2014 Aug 1;546(1):25-34. doi: 10.1016/j.gene.2014.05.043. Epub 2014 May 22.

Genome classification improvements based on k-mer intervals in sequences.

Genomics. 2019 Dec;111(6):1574-1582. doi: 10.1016/j.ygeno.2018.11.001. Epub 2018 Nov 13.

Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction.

J Comput Biol. 2017 Feb;24(2):153-171. doi: 10.1089/cmb.2015.0216. Epub 2016 Jul 7.

KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences.

Mol Phylogenet Evol. 2023 Feb;179:107662. doi: 10.1016/j.ympev.2022.107662. Epub 2022 Nov 11.

A new method to cluster genomes based on cumulative Fourier power spectrum.

Gene. 2018 Oct 5;673:239-250. doi: 10.1016/j.gene.2018.06.042. Epub 2018 Jun 20.

Alignment-free sequence comparison for virus genomes based on location correlation coefficient.

Infect Genet Evol. 2021 Dec;96:105106. doi: 10.1016/j.meegid.2021.105106. Epub 2021 Oct 6.

Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison - A Review.

Comb Chem High Throughput Screen. 2022;25(3):365-380. doi: 10.2174/1386207324666210811101437.

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

引用本文的文献

The grand biological universe: A comprehensive geometric construction of genome space.

Innovation (Camb). 2025 Apr 30;6(8):100937. doi: 10.1016/j.xinn.2025.100937. eCollection 2025 Aug 4.

DNABERT-S: pioneering species differentiation with species-aware DNA embeddings.

Bioinformatics. 2025 Jul 1;41(Supplement_1):i255-i264. doi: 10.1093/bioinformatics/btaf188.

Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model.

Sci Rep. 2025 Jul 1;15(1):20920. doi: 10.1038/s41598-025-05717-3.

iKcr-DRC: prediction of lysine crotonylation sites in proteins based on a novel attention module and DenseNet.

Front Genet. 2025 Jun 11;16:1574832. doi: 10.3389/fgene.2025.1574832. eCollection 2025.

Genome language modeling (GLM): a beginner's cheat sheet.

Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.

Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.

Evaluating Neural Network Performance in Predicting Disease Status and Tissue Source of JC Polyomavirus from Patient Isolates Based on the Hypervariable Region of the Viral Genome.

Viruses. 2024 Dec 25;17(1):12. doi: 10.3390/v17010012.

Exploring the Promoter Generation and Prediction of spp. Based on GAN and Multi-Model Fusion Methods.

Int J Mol Sci. 2024 Dec 6;25(23):13137. doi: 10.3390/ijms252313137.

MFPSP: Identification of fungal species-specific phosphorylation site using offspring competition-based genetic algorithm.

PLoS Comput Biol. 2024 Nov 18;20(11):e1012607. doi: 10.1371/journal.pcbi.1012607. eCollection 2024 Nov.

DRpred: A Novel Deep Learning-Based Predictor for Multi-Label mRNA Subcellular Localization Prediction by Incorporating Bayesian Inferred Prior Label Relationships.

Biomolecules. 2024 Aug 26;14(9):1067. doi: 10.3390/biom14091067.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

kmer2vec：一种基于 word2vec 嵌入的 DNA 序列比较新方法。

kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译