SAFPred：利用蛋白质嵌入进行细菌的基因功能预测

SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings.

机构信息

Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Delft XE 2628, The Netherlands.

Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

出版信息

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae328.

DOI:10.1093/bioinformatics/btae328

PMID:38775729

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11147799/

Abstract

MOTIVATION

Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.

RESULTS

To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.

AVAILABILITY AND IMPLEMENTATION

https://github.com/AbeelLab/safpred.

摘要

动机

如今，我们仅能了解从基因组数据预测的蛋白质序列中的一小部分的功能。对于细菌来说，这个问题更加突出，因为它们代表了地球上在系统发生和代谢上最多样化的分类群之一。大多数功能预测算法都集中在真核生物上，而传统的注释方法依赖于现有数据库中存在相似序列，这使得细菌的基因注释率更低。然而，对于新的细菌蛋白质，通常不存在这样的序列。因此，我们需要针对细菌改进功能预测方法。最近，基于自然语言处理领域的变压器语言模型被用于获取蛋白质的新表示形式，以替代氨基酸序列。这些表示形式，称为蛋白质嵌入，已被证明对改善真核生物的注释有很大的帮助，但在细菌基因组上的应用却非常有限。

结果

为了预测细菌的基因功能，我们开发了 SAFPred，这是一种基于最先进的蛋白质语言模型的蛋白质嵌入的新的同源性感知基因功能预测工具。SAFPred 还通过保守的同线性利用了细菌独特的操纵子结构。SAFPred 在多个细菌物种上的表现均优于传统的基于序列的注释方法和最先进的方法，包括在远源同源检测方面，其与训练集中蛋白质的序列相似性低至 40%。使用 SAFPred 来识别不同肠球菌中的基因功能，其中一些物种是主要的临床威胁，我们鉴定出 11 个以前未被识别的潜在新型毒素，它们可能对人类和动物健康有重要意义。