Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada.
Centre for Structural and Functional Genomics, Concordia University, Montréal, Québec, Canada.
Proteins. 2024 Aug;92(8):998-1055. doi: 10.1002/prot.26694. Epub 2024 Apr 24.
This study introduces TooT-PLM-ionCT, a comprehensive framework that consolidates three distinct systems, each meticulously tailored for one of the following tasks: distinguishing ion channels (ICs) from membrane proteins (MPs), segregating ion transporters (ITs) from MPs, and differentiating ICs from ITs. Drawing upon the strengths of six Protein Language Models (PLMs)-ProtBERT, ProtBERT-BFD, ESM-1b, ESM-2 (650M parameters), and ESM-2 (15B parameters), TooT-PLM-ionCT employs a combination of traditional classifiers and deep learning models for nuanced protein classification. Originally validated on an existing dataset by previous researchers, our systems demonstrated superior performance in identifying ITs from MPs and distinguishing ICs from ITs, with the IC-MP discrimination achieving state-of-the-art results. In light of recommendations for additional validation, we introduced a new dataset, significantly enhancing the robustness and generalization of our models across bioinformatics challenges. This new evaluation underscored the effectiveness of TooT-PLM-ionCT in adapting to novel data while maintaining high classification accuracy. Furthermore, this study explores critical factors affecting classification accuracy, such as dataset balancing, the impact of using frozen versus fine-tuned PLM representations, and the variance between half and full precision in floating-point computations. To facilitate broader application and accessibility, a web server (https://tootsuite.encs.concordia.ca/service/TooT-PLM-ionCT) has been developed, allowing users to evaluate unknown protein sequences through our specialized systems for IC-MP, IT-MP, and IC-IT classification tasks.
本研究引入了 TooT-PLM-ionCT,这是一个综合框架,整合了三个不同的系统,每个系统都经过精心设计,用于以下三个任务之一:区分离子通道(ICs)和膜蛋白(MPs)、分离离子转运蛋白(ITs)和 MPs、区分 ICs 和 ITs。该框架利用了六个蛋白质语言模型(PLMs)——ProtBERT、ProtBERT-BFD、ESM-1b、ESM-2(650M 参数)和 ESM-2(15B 参数)的优势,采用传统分类器和深度学习模型相结合的方法,对蛋白质进行细微分类。我们的系统最初在先前研究人员的现有数据集上进行了验证,在从 MPs 中识别 ITs 和区分 ICs 和 ITs 方面表现出了优异的性能,IC-MP 区分达到了最新水平。根据对进一步验证的建议,我们引入了一个新的数据集,显著提高了我们模型在生物信息学挑战中的稳健性和泛化能力。新的评估结果突出了 TooT-PLM-ionCT 适应新数据的有效性,同时保持了较高的分类准确性。此外,本研究探讨了影响分类准确性的关键因素,例如数据集平衡、使用冻结的还是微调的 PLM 表示的影响,以及浮点数计算中半精度和全精度之间的差异。为了促进更广泛的应用和访问,我们开发了一个网络服务器(https://tootsuite.encs.concordia.ca/service/TooT-PLM-ionCT),允许用户通过我们专门的 IC-MP、IT-MP 和 IC-IT 分类任务系统来评估未知的蛋白质序列。