16S rRNA 序列嵌入：核苷酸序列有意义的数值特征表示形式，方便下游分析。

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

机构信息

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America.

出版信息

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

DOI:10.1371/journal.pcbi.1006721

PMID:30807567

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6407789/

Abstract

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.

摘要

高通量测序技术的进步增加了微生物组测序数据的可用性，这些数据可用于原位表征微生物组群落结构。我们探索使用单词和句子嵌入方法对核苷酸序列进行编码，因为它们可能是下游机器学习应用（尤其是深度学习）的合适数值表示。这项工作首先将每个序列编码（“嵌入”）为密集的、低维的数字向量空间。在这里，我们使用 Skip-Gram word2vec 嵌入 16S rRNA 扩增子调查获得的 k-mer，然后利用现有的句子嵌入技术来嵌入属于特定身体部位或样本的所有序列。我们证明了这些表示是有意义的，因此可以利用嵌入空间作为探索性分析的特征提取形式。我们表明，序列嵌入保留了有关测序数据的相关信息，例如 k-mer 上下文、序列分类和样本类别。具体来说，序列嵌入空间解决了门、科内属之间的差异。序列嵌入之间的距离与比对身份之间的距离具有相似的性质，并且可以认为嵌入多个序列会生成一个共识序列。此外，嵌入是通用特征，可用于许多下游任务，如分类和样本分类。与使用 OTU 丰度数据相比，使用样本嵌入进行身体部位分类几乎没有性能损失，并且聚类嵌入产生了高保真度的物种聚类。最后，k-mer 嵌入空间捕获了映射到 16S rRNA 基因特定区域并与特定身体部位相对应的独特 k-mer 分布。总之，我们的结果表明，嵌入序列会产生有意义的表示，可以用于探索性分析或需要数值数据的下游机器学习应用。此外，由于嵌入是在无监督的方式下进行训练的，因此可以嵌入未标记的数据并用于增强监督机器学习任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/cfb3a2ca38e4/pcbi.1006721.g001.jpg

相似文献

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.

Bioinformatics. 2018 Jul 1;34(13):i32-i42. doi: 10.1093/bioinformatics/bty296.

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad617.

Species-level bacterial community profiling of the healthy sinonasal microbiome using Pacific Biosciences sequencing of full-length 16S rRNA genes.

Microbiome. 2018 Oct 23;6(1):190. doi: 10.1186/s40168-018-0569-2.

Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering.

Microbiome. 2015 Oct 5;3:43. doi: 10.1186/s40168-015-0105-6.

Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies.

Microbiome. 2016 Nov 25;4(1):62. doi: 10.1186/s40168-016-0208-8.

HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies.

Genome Biol. 2018 Jun 27;19(1):82. doi: 10.1186/s13059-018-1450-0.

Learned protein embeddings for machine learning.

Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.

An extended single-index multiplexed 16S rRNA sequencing for microbial community analysis on MiSeq illumina platforms.

J Basic Microbiol. 2016 Mar;56(3):321-6. doi: 10.1002/jobm.201500420. Epub 2015 Oct 1.

A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis.

Microbiome. 2017 Aug 14;5(1):100. doi: 10.1186/s40168-017-0314-2.

引用本文的文献

Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data.

PLoS Comput Biol. 2025 May 7;21(5):e1011353. doi: 10.1371/journal.pcbi.1011353. eCollection 2025 May.

Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.

mSystems. 2025 Mar 18;10(3):e0155024. doi: 10.1128/msystems.01550-24. Epub 2025 Feb 20.

RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.

Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.

scEGG: an exogenous gene-guided clustering method for single-cell transcriptomic data.

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae483.

The Role and Applications of Artificial Intelligence in the Treatment of Chronic Pain.

Curr Pain Headache Rep. 2024 Aug;28(8):769-784. doi: 10.1007/s11916-024-01264-0. Epub 2024 Jun 1.

Deep learning methods in metagenomics: a review.

Microb Genom. 2024 Apr;10(4). doi: 10.1099/mgen.0.001231.

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE.

BMC Biol. 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8.

A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures.

PLoS Comput Biol. 2023 Jan 6;19(1):e1010821. doi: 10.1371/journal.pcbi.1010821. eCollection 2023 Jan.

Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity.

Biology (Basel). 2022 Dec 8;11(12):1786. doi: 10.3390/biology11121786.

Revealing General Patterns of Microbiomes That Transcend Systems: Potential and Challenges of Deep Transfer Learning.

mSystems. 2022 Feb 22;7(1):e0105821. doi: 10.1128/msystems.01058-21. Epub 2022 Jan 18.

本文引用的文献

Analysis and correction of compositional bias in sparse sequencing count data.

BMC Genomics. 2018 Nov 6;19(1):799. doi: 10.1186/s12864-018-5160-5.

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.

Bioinformatics. 2018 Jul 1;34(13):i32-i42. doi: 10.1093/bioinformatics/bty296.

American Gut: an Open Platform for Citizen Science Microbiome Research.

mSystems. 2018 May 15;3(3). doi: 10.1128/mSystems.00031-18. eCollection 2018 May-Jun.

Opportunities and obstacles for deep learning in biology and medicine.

J R Soc Interface. 2018 Apr;15(141). doi: 10.1098/rsif.2017.0387.

Updating the 97% identity threshold for 16S ribosomal RNA OTUs.

Bioinformatics. 2018 Jul 15;34(14):2371-2375. doi: 10.1093/bioinformatics/bty113.

The human skin microbiome.

Nat Rev Microbiol. 2018 Mar;16(3):143-155. doi: 10.1038/nrmicro.2017.157. Epub 2018 Jan 15.

Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding.

Bioinformatics. 2017 Jul 15;33(14):i92-i101. doi: 10.1093/bioinformatics/btx234.

Gut microbiota and IBD: causation or correlation?

Nat Rev Gastroenterol Hepatol. 2017 Oct;14(10):573-584. doi: 10.1038/nrgastro.2017.88. Epub 2017 Jul 19.

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.

ISME J. 2017 Dec;11(12):2639-2643. doi: 10.1038/ismej.2017.119. Epub 2017 Jul 21.

A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity.

NPJ Biofilms Microbiomes. 2016 Apr 20;2:16004. doi: 10.1038/npjbiofilms.2016.4. eCollection 2016.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

16S rRNA 序列嵌入：核苷酸序列有意义的数值特征表示形式，方便下游分析。

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献