• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

16S rRNA 序列嵌入:核苷酸序列有意义的数值特征表示形式,方便下游分析。

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

机构信息

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America.

Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America.

出版信息

PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.

DOI:10.1371/journal.pcbi.1006721
PMID:30807567
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6407789/
Abstract

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.

摘要

高通量测序技术的进步增加了微生物组测序数据的可用性,这些数据可用于原位表征微生物组群落结构。我们探索使用单词和句子嵌入方法对核苷酸序列进行编码,因为它们可能是下游机器学习应用(尤其是深度学习)的合适数值表示。这项工作首先将每个序列编码(“嵌入”)为密集的、低维的数字向量空间。在这里,我们使用 Skip-Gram word2vec 嵌入 16S rRNA 扩增子调查获得的 k-mer,然后利用现有的句子嵌入技术来嵌入属于特定身体部位或样本的所有序列。我们证明了这些表示是有意义的,因此可以利用嵌入空间作为探索性分析的特征提取形式。我们表明,序列嵌入保留了有关测序数据的相关信息,例如 k-mer 上下文、序列分类和样本类别。具体来说,序列嵌入空间解决了门、科内属之间的差异。序列嵌入之间的距离与比对身份之间的距离具有相似的性质,并且可以认为嵌入多个序列会生成一个共识序列。此外,嵌入是通用特征,可用于许多下游任务,如分类和样本分类。与使用 OTU 丰度数据相比,使用样本嵌入进行身体部位分类几乎没有性能损失,并且聚类嵌入产生了高保真度的物种聚类。最后,k-mer 嵌入空间捕获了映射到 16S rRNA 基因特定区域并与特定身体部位相对应的独特 k-mer 分布。总之,我们的结果表明,嵌入序列会产生有意义的表示,可以用于探索性分析或需要数值数据的下游机器学习应用。此外,由于嵌入是在无监督的方式下进行训练的,因此可以嵌入未标记的数据并用于增强监督机器学习任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/e5041abe41d9/pcbi.1006721.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/cfb3a2ca38e4/pcbi.1006721.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/d2e449f4d0a2/pcbi.1006721.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/d60abe7dc89c/pcbi.1006721.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/9c1aa7893063/pcbi.1006721.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/49bbbdb36188/pcbi.1006721.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/fb2c268cbb46/pcbi.1006721.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/0a0e4c16c4e2/pcbi.1006721.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/1214956adf81/pcbi.1006721.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/e5041abe41d9/pcbi.1006721.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/cfb3a2ca38e4/pcbi.1006721.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/d2e449f4d0a2/pcbi.1006721.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/d60abe7dc89c/pcbi.1006721.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/9c1aa7893063/pcbi.1006721.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/49bbbdb36188/pcbi.1006721.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/fb2c268cbb46/pcbi.1006721.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/0a0e4c16c4e2/pcbi.1006721.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/1214956adf81/pcbi.1006721.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/36d3/6407789/e5041abe41d9/pcbi.1006721.g009.jpg

相似文献

1
16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.16S rRNA 序列嵌入:核苷酸序列有意义的数值特征表示形式,方便下游分析。
PLoS Comput Biol. 2019 Feb 26;15(2):e1006721. doi: 10.1371/journal.pcbi.1006721. eCollection 2019 Feb.
2
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.MicroPheno:使用基于 k -mer 的浅层子样本表示从 16S rRNA 基因测序中预测环境和宿主表型。
Bioinformatics. 2018 Jul 1;34(13):i32-i42. doi: 10.1093/bioinformatics/bty296.
3
Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.研究非标准预训练的核苷酸序列上的 BERT 模型,并评估不同的 k-mer 嵌入。
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad617.
4
Species-level bacterial community profiling of the healthy sinonasal microbiome using Pacific Biosciences sequencing of full-length 16S rRNA genes.采用 Pacific Biosciences 全长 16S rRNA 基因测序技术对健康鼻窦微生物组进行细菌群落物种水平分析。
Microbiome. 2018 Oct 23;6(1):190. doi: 10.1186/s40168-018-0569-2.
5
Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering.利用长读长16S rRNA基因扩增子测序和通用层次聚类改进操作分类单元(OTU)挑选
Microbiome. 2015 Oct 5;3:43. doi: 10.1186/s40168-015-0105-6.
6
Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies.大规模基准测试揭示了微生物组研究中使用的 16S rRNA 基因扩增子数据分析方法中的假发现和计数转换敏感性。
Microbiome. 2016 Nov 25;4(1):62. doi: 10.1186/s40168-016-0208-8.
7
HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies.HmmUFOtu:一种基于 HMM 和系统发育分析的微生物组扩增子测序研究的超快速分类分配和 OTU 提取工具。
Genome Biol. 2018 Jun 27;19(1):82. doi: 10.1186/s13059-018-1450-0.
8
Learned protein embeddings for machine learning.机器学习的深度学习蛋白质嵌入。
Bioinformatics. 2018 Aug 1;34(15):2642-2648. doi: 10.1093/bioinformatics/bty178.
9
An extended single-index multiplexed 16S rRNA sequencing for microbial community analysis on MiSeq illumina platforms.一种用于在Illumina MiSeq平台上进行微生物群落分析的扩展单索引多重16S rRNA测序方法。
J Basic Microbiol. 2016 Mar;56(3):321-6. doi: 10.1002/jobm.201500420. Epub 2015 Oct 1.
10
A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis.SL1p 管道用于 16S rRNA 基因测序分析的综合评估。
Microbiome. 2017 Aug 14;5(1):100. doi: 10.1186/s40168-017-0314-2.

引用本文的文献

1
Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data.学习用于微生物群落的深度语言模型:大规模未标记微生物群落数据的力量。
PLoS Comput Biol. 2025 May 7;21(5):e1011353. doi: 10.1371/journal.pcbi.1011353. eCollection 2025 May.
2
Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.通过k-mer频率计数将序列组成信息整合到微生物多样性分析中。
mSystems. 2025 Mar 18;10(3):e0155024. doi: 10.1128/msystems.01550-24. Epub 2025 Feb 20.
3
RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.

本文引用的文献

1
Analysis and correction of compositional bias in sparse sequencing count data.稀疏测序计数数据中组成偏差的分析与校正。
BMC Genomics. 2018 Nov 6;19(1):799. doi: 10.1186/s12864-018-5160-5.
2
MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.MicroPheno:使用基于 k -mer 的浅层子样本表示从 16S rRNA 基因测序中预测环境和宿主表型。
Bioinformatics. 2018 Jul 1;34(13):i32-i42. doi: 10.1093/bioinformatics/bty296.
3
American Gut: an Open Platform for Citizen Science Microbiome Research.
RNA序列分析全景:任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述
Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.
4
scEGG: an exogenous gene-guided clustering method for single-cell transcriptomic data.scEGG:一种基于外源基因指导的单细胞转录组数据聚类方法。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae483.
5
The Role and Applications of Artificial Intelligence in the Treatment of Chronic Pain.人工智能在慢性疼痛治疗中的作用和应用。
Curr Pain Headache Rep. 2024 Aug;28(8):769-784. doi: 10.1007/s11916-024-01264-0. Epub 2024 Jun 1.
6
Deep learning methods in metagenomics: a review.元基因组学中的深度学习方法:综述。
Microb Genom. 2024 Apr;10(4). doi: 10.1099/mgen.0.001231.
7
Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE.基于序列物理化学模式和分布式表示信息的 DeepSoluE 预测蛋白质溶解度。
BMC Biol. 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8.
8
A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures.基于 k-mer 的宏基因组距离与基于系统发育信息的 β 多样性测度之间的便捷对应关系。
PLoS Comput Biol. 2023 Jan 6;19(1):e1010821. doi: 10.1371/journal.pcbi.1010821. eCollection 2023 Jan.
9
Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity.用于预测COVID-19疾病严重程度的SARS-CoV-2刺突蛋白序列的可解释性和预测性深度神经网络建模
Biology (Basel). 2022 Dec 8;11(12):1786. doi: 10.3390/biology11121786.
10
Revealing General Patterns of Microbiomes That Transcend Systems: Potential and Challenges of Deep Transfer Learning.揭示超越系统的微生物组的一般模式:深度迁移学习的潜力和挑战。
mSystems. 2022 Feb 22;7(1):e0105821. doi: 10.1128/msystems.01058-21. Epub 2022 Jan 18.
美国肠道计划:一个用于公民科学微生物组研究的开放平台。
mSystems. 2018 May 15;3(3). doi: 10.1128/mSystems.00031-18. eCollection 2018 May-Jun.
4
Opportunities and obstacles for deep learning in biology and medicine.深度学习在生物学和医学中的机遇与挑战。
J R Soc Interface. 2018 Apr;15(141). doi: 10.1098/rsif.2017.0387.
5
Updating the 97% identity threshold for 16S ribosomal RNA OTUs.更新 16S 核糖体 RNA OTUs 的 97%同一性阈值。
Bioinformatics. 2018 Jul 15;34(14):2371-2375. doi: 10.1093/bioinformatics/bty113.
6
The human skin microbiome.人体皮肤微生物组。
Nat Rev Microbiol. 2018 Mar;16(3):143-155. doi: 10.1038/nrmicro.2017.157. Epub 2018 Jan 15.
7
Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding.基于 k- -mer 嵌入卷积长短期记忆网络的染色质可及性预测。
Bioinformatics. 2017 Jul 15;33(14):i92-i101. doi: 10.1093/bioinformatics/btx234.
8
Gut microbiota and IBD: causation or correlation?肠道微生物群与炎症性肠病:因果关系还是相关性?
Nat Rev Gastroenterol Hepatol. 2017 Oct;14(10):573-584. doi: 10.1038/nrgastro.2017.88. Epub 2017 Jul 19.
9
Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.在标记基因数据分析中,精确序列变体应取代操作分类单元。
ISME J. 2017 Dec;11(12):2639-2643. doi: 10.1038/ismej.2017.119. Epub 2017 Jul 21.
10
A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity.基于序列相似性的16S rRNA操作分类单元聚类的观点
NPJ Biofilms Microbiomes. 2016 Apr 20;2:16004. doi: 10.1038/npjbiofilms.2016.4. eCollection 2016.