利用大型语言模型预测蛋白质相变:一种物理、多尺度和可解释的方法。

Leveraging a large language model to predict protein phase transition: A physical, multiscale, and interpretable approach.

机构信息

Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520.

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06510.

出版信息

Proc Natl Acad Sci U S A. 2024 Aug 13;121(33):e2320510121. doi: 10.1073/pnas.2320510121. Epub 2024 Aug 7.

Abstract

Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs. Here, we fine-tune a LLM for predicting PPTs and demonstrate its usage in evaluating how sequence variants affect PPTs, an operation useful for protein design. In addition, we show its superior performance compared to suitable classical benchmarks. Due to the "black-box" nature of the LLM, we also employ a classical random forest model along with biophysical features to facilitate interpretation. Finally, focusing on Alzheimer's disease-related proteins, we demonstrate that greater aggregation is associated with reduced gene expression in Alzheimer's disease, suggesting a natural defense mechanism.

摘要

蛋白质从可溶性状态到密集液相(通过液-液相分离形成液滴)或固体聚集物(如淀粉样物)的相转变(PPT)在与年龄相关的疾病(如阿尔茨海默病)相关的病理过程中起着关键作用。有几个计算框架能够分别根据蛋白质序列预测液滴或淀粉样聚集物的形成,但没有一个在统一框架内解决两者的预测问题。最近,大型语言模型(LLM)在蛋白质结构预测方面取得了巨大成功;然而,它们尚未用于 PPT。在这里,我们对 LLM 进行微调以预测 PPT,并展示其在评估序列变异如何影响 PPT 方面的用途,这是一种对蛋白质设计有用的操作。此外,我们还展示了它与合适的经典基准相比的优越性能。由于 LLM 的“黑盒”性质,我们还使用经典随机森林模型和生物物理特征来促进解释。最后,我们关注与阿尔茨海默病相关的蛋白质,证明在阿尔茨海默病中,更大的聚集与基因表达降低有关,这表明存在一种自然防御机制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8350/11331094/ebd1f45449c7/pnas.2320510121fig01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索