Suppr超能文献

基于嵌入的比对方法在 GST 超家族分类中的应用能力测试:蛋白质长度的作用。

Testing the Capability of Embedding-Based Alignments on the GST Superfamily Classification: The Role of Protein Length.

机构信息

Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, 40126 Bologna, Italy.

出版信息

Molecules. 2024 Sep 29;29(19):4616. doi: 10.3390/molecules29194616.

Abstract

In order to shed light on the usage of protein language model-based alignment procedures, we attempted the classification of Glutathione S-transferases (GST; EC 2.5.1.18) and compared our results with the ARBA/UNI rule-based annotation in UniProt. GST is a protein superfamily involved in cellular detoxification from harmful xenobiotics and endobiotics, widely distributed in prokaryotes and eukaryotes. What is particularly interesting is that the superfamily is characterized by different classes, comprising proteins from different taxa that can act in different cell locations (cytosolic, mitochondrial and microsomal compartments) with different folds and different levels of sequence identity with remote homologs. For this reason, GST functional annotation in a specific class is problematic: unless a structure is released, the protein can be classified only on the basis of sequence similarity, which excludes the annotation of remote homologs. Here, we adopt an embedding-based alignment to classify 15,061 GST proteins automatically annotated by the UniProt-ARBA/UNI rules. Embedding is based on the Meta ESM2-15b protein language. The embedding-based alignment reaches more than a 99% rate of perfect matching with the UniProt automatic procedure. Data analysis indicates that 46% of the UniProt automatically classified proteins do not conserve the typical length of canonical GSTs, whose structure is known. Therefore, 46% of the classified proteins do not conserve the template/s structure required for their family classification. Our approach finds that 41% of 64,207 GST UniProt proteins not yet assigned to any class can be classified consistently with the structural template length.

摘要

为了阐明基于蛋白质语言模型对齐程序的应用,我们尝试对谷胱甘肽 S-转移酶(GST;EC 2.5.1.18)进行分类,并将结果与 UniProt 中的 ARBA/UNI 基于规则的注释进行比较。GST 是一个参与细胞解毒的蛋白质超家族,可抵御有害物质和内源性物质,广泛分布于原核生物和真核生物中。特别有趣的是,该超家族的特点是不同的类别,包含来自不同分类单元的蛋白质,这些蛋白质可以在不同的细胞位置(细胞质、线粒体和微粒体区室)发挥作用,具有不同的折叠和与远程同源物的不同序列同一性水平。出于这个原因,在特定类别中对 GST 进行功能注释是有问题的:除非发布结构,否则只能根据序列相似性对蛋白质进行分类,这排除了对远程同源物的注释。在这里,我们采用基于嵌入的对齐方法对 UniProt-ARBA/UNI 规则自动注释的 15061 个 GST 蛋白质进行分类。嵌入是基于 Meta ESM2-15b 蛋白质语言的。基于嵌入的对齐与 UniProt 自动程序的完美匹配率超过 99%。数据分析表明,UniProt 自动分类的蛋白质中,有 46%的蛋白质不保留已知结构的典型长度的规范 GSTs,因此,46%的分类蛋白质不保留其家族分类所需的模板/s 结构。我们的方法发现,64207 个 GST UniProt 蛋白质中,有 41%尚未分配到任何类别,可以与结构模板长度一致地进行分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7eb4/11478096/e302064ada4b/molecules-29-04616-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验