Suppr超能文献

专家指导的蛋白质语言模型可实现准确且超快的适应度预测。

Expert-guided protein language models enable accurate and blazingly fast fitness prediction.

机构信息

School of Computation, Information, and Technology, Department of Informatics, Bioinformatics and Computational Biology, Technical University of Munich, Garching/Munich 85748, Germany.

Laboratory of Computational and Quantitative Biology, UMR 7238, Sorbonne Université, CNRS, IBPS, Paris 75005, France.

出版信息

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae621.

Abstract

MOTIVATION

Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein language model (pLM) embeddings as input to a minimal deep learning model.

RESULTS

To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark (217 multiplex assays of variant effect-MAVE-with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48 ± 0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 min on a consumer laptop (12-core CPU, 16 GB RAM).

AVAILABILITY AND IMPLEMENTATION

VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.

摘要

动机

对所有已知蛋白质变体影响的详尽实验注释仍然令人生畏且昂贵,这强调了需要可扩展的影响预测。我们引入了 VespaG,这是一种快速的错义氨基酸变体效应预测器,它利用蛋白质语言模型 (pLM) 嵌入作为输入,使用最小的深度学习模型。

结果

为了克服实验训练数据的稀疏性,我们应用基于多序列比对的效应预测器 GEMME 创建了一个包含 3900 万个人类蛋白质组中单一氨基酸变体的数据集,作为伪标准真理。与基线 pLM 相比,这种设置提高了可解释性,并且可以轻松使用新的或更新的 pLM 进行重新训练。在 ProteinGym 基准测试中(217 个变体效应多重测定-MAVE-有 250 万个变体),VespaG 达到了 0.48±0.02 的平均 Spearman 相关性,与在相同数据上评估的表现最好的方法相匹配。VespaG 的优势在于速度快几个数量级,可以在不到 30 分钟的时间内在消费者笔记本电脑上预测人类或果蝇等蛋白质组中的所有蛋白质的突变景观(12 核 CPU,16GB RAM)。

可用性和实现

VespaG 可在 https://github.com/jschlensok/vespag 上免费获得。相关的训练数据和预测结果可在 https://doi.org/10.5281/zenodo.11085958 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e3/11588025/aeec37827795/btae621f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验