Suppr超能文献

T4SEfinder:一种使用预先训练的蛋白质语言模型进行基于基因组规模预测细菌 IV 型分泌效应子的生物信息学工具。

T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model.

机构信息

State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences & Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, China.

State Key Laboratory of Pathogens and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing 100071, China.

出版信息

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab420.

Abstract

Bacterial type IV secretion systems (T4SSs) are versatile and membrane-spanning apparatuses, which mediate both genetic exchange and delivery of effector proteins to target eukaryotic cells. The secreted effectors (T4SEs) can affect gene expression and signal transduction of the host cells. As such, they often function as virulence factors and play an important role in bacterial pathogenesis. Nowadays, T4SE prediction tools have utilized various machine learning algorithms, but the accuracy and speed of these tools remain to be improved. In this study, we apply a sequence embedding strategy from a pre-trained language model of protein sequences (TAPE) to the classification task of T4SEs. The training dataset is mainly derived from our updated type IV secretion system database SecReT4 with newly experimentally verified T4SEs. An online web server termed T4SEfinder is developed using TAPE and a multi-layer perceptron (MLP) for T4SE prediction after a comprehensive performance comparison with several candidate models, which achieves a slightly higher level of accuracy than the existing prediction tools. It only takes about 3 minutes to make a classification for 5000 protein sequences by T4SEfinder so that the computational speed is qualified for whole genome-scale T4SEs detection in pathogenic bacteria. T4SEfinder might contribute to meet the increasing demands of re-annotating secretion systems and effector proteins in sequenced bacterial genomes. T4SEfinder is freely accessible at https://tool2-mml.sjtu.edu.cn/T4SEfinder_TAPE/.

摘要

细菌 IV 型分泌系统(T4SSs)是多功能的跨膜装置,介导遗传交换和效应蛋白向靶真核细胞的输送。分泌的效应蛋白(T4SEs)可以影响宿主细胞的基因表达和信号转导。因此,它们通常作为毒力因子发挥作用,在细菌发病机制中起着重要作用。如今,T4SE 预测工具已经利用了各种机器学习算法,但这些工具的准确性和速度仍有待提高。在这项研究中,我们将来自蛋白质序列预训练语言模型(TAPE)的序列嵌入策略应用于 T4SE 的分类任务。训练数据集主要来自我们更新的 IV 型分泌系统数据库 SecReT4,其中包含新的经过实验验证的 T4SE。在与几个候选模型进行全面性能比较后,使用 TAPE 和多层感知机(MLP)开发了一个名为 T4SEfinder 的在线网络服务器,用于 T4SE 预测,其准确性略高于现有预测工具。T4SEfinder 只需大约 3 分钟即可对 5000 条蛋白质序列进行分类,因此计算速度足以用于在致病菌中进行全基因组规模的 T4SE 检测。T4SEfinder 可能有助于满足重新注释测序细菌基因组中分泌系统和效应蛋白的需求不断增加的需求。T4SEfinder 可在 https://tool2-mml.sjtu.edu.cn/T4SEfinder_TAPE/ 免费获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验