Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Pertamina, School of Computer Science, Jl Teuku Nyak Arief Jakarta Selatan DKI, Jakarta, Indonesia.
Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas UniversalKompleks Maha Vihara Duta Maitreya Bukit Beruntung, Sei Panas Batam, Kepulauan, Riau 29456, Indonesia.
Comput Biol Chem. 2024 Oct;112:108163. doi: 10.1016/j.compbiolchem.2024.108163. Epub 2024 Jul 26.
The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.
生物技术中对环保技术的需求不断增加,这就需要高效且可持续的催化剂。在强酸环境中能最佳发挥作用的嗜酸蛋白在食品生产、生物燃料和生物修复等各种应用中具有巨大的潜力。然而,我们对这些蛋白质的了解有限,这阻碍了它们的开发。本研究通过使用计算工具和机器学习的计算方法来解决这一差距。我们提出了一种使用蛋白质语言模型 (PLM) 来预测嗜酸蛋白的新方法,无需进行广泛的实验室工作即可加速发现。我们的研究强调了 PLM 在理解和利用嗜酸蛋白以促进科学和工业进步方面的潜力。我们引入了 ACE 模型,该模型将简单的逻辑回归模型与 ProtT5 PLM 处理的蛋白质序列的嵌入相结合。该模型在独立测试集上取得了很高的性能,准确率为 0.91,F1 得分为 0.93,马修斯相关系数为 0.76。据我们所知,这是首次将预训练的 PLM 嵌入应用于嗜酸蛋白分类。ACE 模型是探索蛋白质嗜酸特性的有力工具,为蛋白质设计和工程的未来发展铺平了道路。