蛋白质结构编码和序列嵌入在转运蛋白底物预测中的应用。

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.

作者信息

Denger Andreas, Helms Volkhard

机构信息

Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Germany.

出版信息

Molecules. 2025 Aug 1;30(15):3226. doi: 10.3390/molecules30153226.

DOI:10.3390/molecules30153226

PMID:40807401

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12348419/

Abstract

Membrane transporters play a crucial role in any cell. Identifying the substrates they translocate across membranes is important for many fields of research, such as metabolomics, pharmacology, and biotechnology. In this study, we leverage recent advances in deep learning, such as amino acid sequence embeddings with protein language models (pLMs), highly accurate 3D structure predictions with AlphaFold 2, and structure-encoding 3Di sequences from FoldSeek, for predicting substrates of membrane transporters. We test new deep learning features derived from both sequence and structure, and compare them to the previously best-performing protein encodings, which were made up of amino acid k-mer frequencies and evolutionary information from PSSMs. Furthermore, we compare the performance of these features either using a previously developed SVM model, or with a regularized feedforward neural network (FNN). When evaluating these models on sugar and amino acid carriers in , as well as on three types of ion channels in human, we found that both the DL-based features and the FNN model led to a better and more consistent classification performance compared to previous methods. Direct encodings of 3D structures with Foldseek, as well as structural embeddings with ProstT5, matched the performance of state-of-the-art amino acid sequence embeddings calculated with the ProtT5-XL model when used as input for the FNN classifier.

摘要

膜转运蛋白在任何细胞中都起着至关重要的作用。确定它们跨膜转运的底物对于许多研究领域都很重要，如代谢组学、药理学和生物技术。在本研究中，我们利用深度学习的最新进展，如使用蛋白质语言模型（pLMs）进行氨基酸序列嵌入、使用AlphaFold 2进行高精度3D结构预测以及使用FoldSeek生成结构编码的3Di序列，来预测膜转运蛋白的底物。我们测试了从序列和结构中衍生出的新的深度学习特征，并将它们与之前表现最佳的蛋白质编码进行比较，后者由氨基酸k-mer频率和来自位置特异性得分矩阵（PSSMs）的进化信息组成。此外，我们使用之前开发的支持向量机（SVM）模型或正则化前馈神经网络（FNN）来比较这些特征的性能。当在[具体研究对象]中的糖和氨基酸载体以及人类的三种离子通道上评估这些模型时，我们发现与之前的方法相比，基于深度学习的特征和FNN模型都带来了更好且更一致的分类性能。当用作FNN分类器的输入时，使用Foldseek对3D结构进行直接编码以及使用ProstT5进行结构嵌入，与使用ProtT5-XL模型计算的最先进氨基酸序列嵌入的性能相匹配。

相似文献

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.蛋白质结构编码和序列嵌入在转运蛋白底物预测中的应用。

Molecules. 2025 Aug 1;30(15):3226. doi: 10.3390/molecules30153226.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?利用蛋白质语言模型进行混合蛋白质-配体结合残基预测：结构重要吗？

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf431.

Short-Term Memory Impairment短期记忆障碍

Advancing the accuracy of clathrin protein prediction through multi-source protein language models.通过多源蛋白质语言模型提高网格蛋白蛋白质预测的准确性。

Sci Rep. 2025 Jul 8;15(1):24403. doi: 10.1038/s41598-025-08510-4.

Advancing the Accuracy of Anti-MRSA Peptide Prediction Through Integrating Multi-Source Protein Language Models.通过整合多源蛋白质语言模型提高抗耐甲氧西林金黄色葡萄球菌肽预测的准确性

Interdiscip Sci. 2025 Mar 11. doi: 10.1007/s12539-025-00696-5.

Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能？开发一种互联网应用算法。

Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.

Comparison of cellulose, modified cellulose and synthetic membranes in the haemodialysis of patients with end-stage renal disease.纤维素、改性纤维素和合成膜在终末期肾病患者血液透析中的比较。

Cochrane Database Syst Rev. 2001(3):CD003234. doi: 10.1002/14651858.CD003234.

本文引用的文献

NA_mCNN: Classification of Sodium Transporters in Membrane Proteins by Integrating Multi-Window Deep Learning and ProtTrans for Their Therapeutic Potential.NA_mCNN：通过整合多窗口深度学习和ProtTrans对膜蛋白中的钠转运体进行分类以挖掘其治疗潜力

J Proteome Res. 2025 May 2;24(5):2324-2335. doi: 10.1021/acs.jproteome.4c00884. Epub 2025 Apr 7.

Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review.评估蛋白质语言模型在蛋白质功能预测编码策略方面的进展：全面综述。

Front Bioeng Biotechnol. 2025 Jan 21;13:1506508. doi: 10.3389/fbioe.2025.1506508. eCollection 2025.

Identifying optimal substrate classes of membrane transporters.确定膜转运蛋白的最佳底物类别。

PLoS One. 2024 Dec 19;19(12):e0315330. doi: 10.1371/journal.pone.0315330. eCollection 2024.

Bilingual language model for protein sequence and structure.用于蛋白质序列和结构的双语语言模型。

NAR Genom Bioinform. 2024 Nov 15;6(4):lqae150. doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.

InterPro: the protein sequence classification resource in 2025.InterPro：2025年的蛋白质序列分类资源。

Nucleic Acids Res. 2025 Jan 6;53(D1):D444-D456. doi: 10.1093/nar/gkae1082.

UniProt: the Universal Protein Knowledgebase in 2025.通用蛋白质知识库（UniProt）：2025年的情况

Nucleic Acids Res. 2025 Jan 6;53(D1):D609-D617. doi: 10.1093/nar/gkae1010.

SPOT: A machine learning model that predicts specific substrates for transport proteins.SPOT：一种用于预测转运蛋白特定底物的机器学习模型。

PLoS Biol. 2024 Sep 26;22(9):e3002807. doi: 10.1371/journal.pbio.3002807. eCollection 2024 Sep.

PANDA-3D: protein function prediction based on AlphaFold models.PANDA-3D：基于AlphaFold模型的蛋白质功能预测

NAR Genom Bioinform. 2024 Aug 6;6(3):lqae094. doi: 10.1093/nargab/lqae094. eCollection 2024 Sep.

AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences.2024 年的 AlphaFold 蛋白质结构数据库：为超过 2.14 亿个蛋白质序列提供结构覆盖。

Nucleic Acids Res. 2024 Jan 5;52(D1):D368-D375. doi: 10.1093/nar/gkad1011.

TT3D: Leveraging precomputed protein 3D sequence models to predict protein-protein interactions.利用预先计算的蛋白质 3D 序列模型预测蛋白质-蛋白质相互作用。

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad663.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

蛋白质结构编码和序列嵌入在转运蛋白底物预测中的应用。

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献