Suppr超能文献

使用 FunGeneTyper 实现微生物蛋白编码基因功能的高精度分类和发现:一个可扩展的深度学习框架。

Highly accurate classification and discovery of microbial protein-coding gene functions using FunGeneTyper: an extensible deep learning framework.

机构信息

College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China.

Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang 310030, China.

出版信息

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae319.

Abstract

High-throughput DNA sequencing technologies decode tremendous amounts of microbial protein-coding gene sequences. However, accurately assigning protein functions to novel gene sequences remain a challenge. To this end, we developed FunGeneTyper, an extensible framework with two new deep learning models (i.e., FunTrans and FunRep), structured databases, and supporting resources for achieving highly accurate (Accuracy > 0.99, F1-score > 0.97) and fine-grained classification of antibiotic resistance genes (ARGs) and virulence factor genes. Using an experimentally confirmed dataset of ARGs comprising remote homologous sequences as the test set, our framework achieves by-far-the-best performance in the discovery of new ARGs from human gut (F1-score: 0.6948), wastewater (0.6072), and soil (0.5445) microbiomes, beating the state-of-the-art bioinformatics tools and sequence alignment-based (F1-score: 0.0556-0.5065) and domain-based (F1-score: 0.2630-0.5224) annotation approaches. Furthermore, our framework is implemented as a lightweight, privacy-preserving, and plug-and-play neural network module, facilitating its versatility and accessibility to developers and users worldwide. We anticipate widespread utilization of FunGeneTyper (https://github.com/emblab-westlake/FunGeneTyper) for precise classification of protein-coding gene functions and the discovery of numerous valuable enzymes. This advancement will have a significant impact on various fields, including microbiome research, biotechnology, metagenomics, and bioinformatics.

摘要

高通量 DNA 测序技术可以解码大量微生物的蛋白质编码基因序列。然而,准确地将新基因序列的蛋白质功能进行分类仍然是一个挑战。为此,我们开发了 FunGeneTyper,这是一个具有两个新的深度学习模型(即 FunTrans 和 FunRep)、结构化数据库和支持资源的可扩展框架,可实现抗生素耐药基因(ARGs)和毒力因子基因的高度准确(Accuracy>0.99,F1 分数>0.97)和细粒度分类。使用包含远程同源序列的经实验验证的 ARG 数据集作为测试集,我们的框架在从人类肠道(F1 分数:0.6948)、废水(0.6072)和土壤(0.5445)微生物组中发现新的 ARG 方面实现了迄今为止最好的性能,击败了最先进的生物信息学工具和基于序列比对(F1 分数:0.0556-0.5065)和基于结构域(F1 分数:0.2630-0.5224)的注释方法。此外,我们的框架被实现为一个轻量级、保护隐私且即插即用的神经网络模块,促进了其多功能性和在全球开发人员和用户中的可访问性。我们预计 FunGeneTyper(https://github.com/emblab-westlake/FunGeneTyper)将被广泛用于精确分类蛋白质编码基因功能和发现大量有价值的酶。这一进展将对微生物组研究、生物技术、宏基因组学和生物信息学等各个领域产生重大影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/edc3/11247404/906a6a04e94b/bbae319ga1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验