Suppr超能文献

SetBERT:用于从高通量测序中进行上下文嵌入和可解释预测的深度学习平台。

SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing.

作者信息

Ludwig David W, Guptil Christopher, Alexander Nicholas R, Zhalnina Kateryna, Wipf Edi M-L, Khasanova Albina, Barber Nicholas A, Swingley Wesley, Walker Donald M, Phillips Joshua L

机构信息

Department of Computer Science, Middle Tennessee State University, Murfreesboro, TN 37132, United States.

Department of Mathematics and Computer Science, Miami University, Oxford, OH 45056, United States.

出版信息

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf370.

Abstract

MOTIVATION

High-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias.

RESULTS

Addressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model.

AVAILABILITY AND IMPLEMENTATION

All source code is available at https://github.com/DLii-Research/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https://github.com/DLii-Research/q2-deepdna.

摘要

动机

高通量测序(HTS)是一种现代测序技术,用于通过对给定样本中微生物的数千个短基因组片段进行测序来分析微生物群落。这项技术为人工智能理解微生物群落的潜在功能关系提供了独特的机会。然而,由于HTS数据的非结构化性质,几乎所有的计算模型都仅限于单独处理DNA序列。这种限制导致它们错过微生物之间的关键相互作用,严重阻碍了我们对这些相互作用如何影响整个微生物群落的理解。此外,大多数计算方法依赖于样本的后处理,这可能会无意中引入特定协议的非故意偏差。

结果

为了解决这些问题,我们提出了SetBERT,这是一种强大的预训练方法,用于创建广义深度学习模型,以处理HTS数据,生成上下文嵌入,并针对具有可解释预测的下游任务进行微调。通过利用序列相互作用,我们表明SetBERT在分类学分类方面显著优于其他模型,属级分类准确率达到95%。此外,我们证明SetBERT能够通过确认模型识别的分类群的生物学相关性来自主准确地解释其预测。

可用性和实现

所有源代码可在https://github.com/DLii-Research/setbert获取。SetBERT可通过q2-deepdna QIIME 2插件使用,其源代码可在https://github.com/DLii-Research/q2-deepdna获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1163/12245400/0954321e0cce/btaf370f7.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验