Suppr超能文献

PLM_Sol:通过使用更新的大肠杆菌蛋白质可溶性数据集对多个蛋白质语言模型进行基准测试来预测蛋白质可溶性。

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.

机构信息

Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China.

Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China.

出版信息

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae404.

Abstract

Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.

摘要

蛋白质溶解度在各种生物技术、工业和生物医学应用中起着至关重要的作用。随着测序和基因合成成本的降低,高通量实验筛选与定制生物信息学预测相结合,为新型功能酶的开发(EOI)带来了快速发展的趋势。在这个过程中,高蛋白质溶解度是必不可少的,而溶解度的准确预测是一项具有挑战性的任务。随着深度学习技术的不断发展,基于注意力的蛋白质语言模型(PLMs)可以从蛋白质序列中提取更多的内在信息。利用这些模型以及越来越多的蛋白质溶解度数据,这些数据是从蛋白质数据库等结构数据库中推断出来的,这对于提高蛋白质溶解度的预测具有很大的潜力。在这项研究中,我们整理了一个更新的大肠杆菌蛋白质溶解度数据集(UESolDS),并结合使用了多种 PLMs 和分类层来预测蛋白质溶解度。表现最佳的模型被命名为基于蛋白质语言模型的蛋白质溶解度预测模型(PLM_Sol),与之前报道的模型相比,该模型取得了显著的改进,在独立测试集上的准确性提高了 6.4%,F1 得分提高了 9.0%,马修斯相关系数得分提高了 11.1%。此外,利用我们内部合成的蛋白质资源作为测试数据进行的额外评估,涵盖了多种类型的酶,也展示了 PLM_Sol 的良好性能。总的来说,PLM_Sol 在独立测试集和实验集上都表现出了一致和有前途的性能,因此非常适合促进大规模的 EOI 研究。PLM_Sol 可作为一个独立的程序使用,也可在 https://zenodo.org/doi/10.5281/zenodo.10675340 上作为易于使用的模型使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a947/11343611/7e2b48d19770/bbae404f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验