• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SetBERT:用于从高通量测序中进行上下文嵌入和可解释预测的深度学习平台。

SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing.

作者信息

Ludwig David W, Guptil Christopher, Alexander Nicholas R, Zhalnina Kateryna, Wipf Edi M-L, Khasanova Albina, Barber Nicholas A, Swingley Wesley, Walker Donald M, Phillips Joshua L

机构信息

Department of Computer Science, Middle Tennessee State University, Murfreesboro, TN 37132, United States.

Department of Mathematics and Computer Science, Miami University, Oxford, OH 45056, United States.

出版信息

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf370.

DOI:10.1093/bioinformatics/btaf370
PMID:40563247
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12245400/
Abstract

MOTIVATION

High-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias.

RESULTS

Addressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model.

AVAILABILITY AND IMPLEMENTATION

All source code is available at https://github.com/DLii-Research/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https://github.com/DLii-Research/q2-deepdna.

摘要

动机

高通量测序(HTS)是一种现代测序技术,用于通过对给定样本中微生物的数千个短基因组片段进行测序来分析微生物群落。这项技术为人工智能理解微生物群落的潜在功能关系提供了独特的机会。然而,由于HTS数据的非结构化性质,几乎所有的计算模型都仅限于单独处理DNA序列。这种限制导致它们错过微生物之间的关键相互作用,严重阻碍了我们对这些相互作用如何影响整个微生物群落的理解。此外,大多数计算方法依赖于样本的后处理,这可能会无意中引入特定协议的非故意偏差。

结果

为了解决这些问题,我们提出了SetBERT,这是一种强大的预训练方法,用于创建广义深度学习模型,以处理HTS数据,生成上下文嵌入,并针对具有可解释预测的下游任务进行微调。通过利用序列相互作用,我们表明SetBERT在分类学分类方面显著优于其他模型,属级分类准确率达到95%。此外,我们证明SetBERT能够通过确认模型识别的分类群的生物学相关性来自主准确地解释其预测。

可用性和实现

所有源代码可在https://github.com/DLii-Research/setbert获取。SetBERT可通过q2-deepdna QIIME 2插件使用,其源代码可在https://github.com/DLii-Research/q2-deepdna获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/4930037743ea/btaf370f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/0954321e0cce/btaf370f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/683b550624fa/btaf370f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/6ccf5b2659a2/btaf370f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/4f0ab0147537/btaf370f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/1a7b15b07e4e/btaf370f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/94266d675290/btaf370f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/4930037743ea/btaf370f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/0954321e0cce/btaf370f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/683b550624fa/btaf370f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/6ccf5b2659a2/btaf370f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/4f0ab0147537/btaf370f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/1a7b15b07e4e/btaf370f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/94266d675290/btaf370f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/4930037743ea/btaf370f6.jpg

相似文献

1
SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing.SetBERT:用于从高通量测序中进行上下文嵌入和可解释预测的深度学习平台。
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf370.
2
Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.液体活检能否通过低深度全基因组测序检测肉瘤患者的循环肿瘤DNA?一项初步评估。
Clin Orthop Relat Res. 2025 Jan 1;483(1):39-48. doi: 10.1097/CORR.0000000000003161. Epub 2024 Jun 21.
3
Short-Term Memory Impairment短期记忆障碍
4
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
5
Perceptions and experiences of the prevention, detection, and management of postpartum haemorrhage: a qualitative evidence synthesis.预防、检测和管理产后出血的认知和经验:定性证据综合。
Cochrane Database Syst Rev. 2023 Nov 27;11(11):CD013795. doi: 10.1002/14651858.CD013795.pub2.
6
A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。
Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.
7
Sexual Harassment and Prevention Training性骚扰与预防培训
8
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
9
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.
10
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

本文引用的文献

1
SetQuence & SetOmic: Deep set transformers for whole genome and exome tumour analysis.SetQuence 和 SetOmic:用于全基因组和外显子组肿瘤分析的深度集转换器。
Biosystems. 2024 Jan;235:105095. doi: 10.1016/j.biosystems.2023.105095. Epub 2023 Dec 6.
2
DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis.DeepBIO:一个自动化的、可解释的深度学习平台,用于高通量生物序列预测、功能注释和可视化分析。
Nucleic Acids Res. 2023 Apr 24;51(7):3017-3029. doi: 10.1093/nar/gkad055.
3
Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.
基于深度神经网络的 DNA 序列分类研究:超越序列相似性的分类方法
Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26.
4
Rhizosphere bacteriome structure and functions.根际细菌组的结构与功能。
Nat Commun. 2022 Feb 11;13(1):836. doi: 10.1038/s41467-022-28448-9.
5
RESCRIPt: Reproducible sequence taxonomy reference database management.RESCIPT:可重复序列分类法参考数据库管理。
PLoS Comput Biol. 2021 Nov 8;17(11):e1009581. doi: 10.1371/journal.pcbi.1009581. eCollection 2021 Nov.
6
Open challenges for microbial network construction and analysis.微生物网络构建与分析的开放性挑战
ISME J. 2021 Nov;15(11):3111-3118. doi: 10.1038/s41396-021-01027-4. Epub 2021 Jun 9.
7
Network analysis methods for studying microbial communities: A mini review.用于研究微生物群落的网络分析方法:一篇小型综述。
Comput Struct Biotechnol J. 2021 May 4;19:2687-2698. doi: 10.1016/j.csbj.2021.05.001. eCollection 2021.
8
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT:用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。
Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.
9
Species abundance information improves sequence taxonomy classification accuracy.物种丰度信息可提高序列分类学分类精度。
Nat Commun. 2019 Oct 11;10(1):4643. doi: 10.1038/s41467-019-12669-6.
10
CAMISIM: simulating metagenomes and microbial communities.CAMISIM:模拟宏基因组和微生物群落。
Microbiome. 2019 Feb 8;7(1):17. doi: 10.1186/s40168-019-0633-6.