Suppr超能文献

评估机器学习时代新兴蛋白质的结构和无序预测工具。

Assessing structure and disorder prediction tools for emerged proteins in the age of machine learning.

机构信息

Institute for Evolution and Bidiversity, University of Muenster, Muenster, 48149, Germany.

Department Protein Evolution, Max Planck-Institute for Biology, Tuebingen, 72076, Germany.

出版信息

F1000Res. 2023 Mar 29;12:347. doi: 10.12688/f1000research.130443.1. eCollection 2023.

Abstract

protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded proteins belong to the so-called "dark protein space". So far, only four protein structures have been experimentally approximated. Low homology, presumed high disorder and limited structures result in low confidence structural predictions for proteins in most cases. Here, we look at the most widely used structure and disorder predictors and assess their applicability for emerged proteins. Since AlphaFold2 is based on the generation of multiple sequence alignments and was trained on solved structures of largely conserved and globular proteins, its performance on proteins remains unknown. More recently, natural language models of proteins have been used for alignment-free structure predictions, potentially making them more suitable for proteins than AlphaFold2. We applied different disorder predictors (IUPred3 short/long, flDPnn) and structure predictors, AlphaFold2 on the one hand and language-based models (Omegafold, ESMfold, RGN2) on the other hand, to four de novo proteins with experimental evidence on structure. We compared the resulting predictions between the different predictors as well as to the existing experimental evidence. Results from IUPred, the most widely used disorder predictor, depend heavily on the choice of parameters and differ significantly from flDPnn which has been found to outperform most other predictors in a comparative assessment study recently. Similarly, different structure predictors yielded varying results and confidence scores for proteins. We suggest that, while in some cases protein language model based approaches might be more accurate than AlphaFold2, the structure prediction of emerged proteins remains a difficult task for any predictor, be it disorder or structure.

摘要

蛋白质编码基因从基因组的非编码区域中全新出现,根据定义,与其他基因没有同源性。因此,它们编码的蛋白质属于所谓的“暗蛋白质空间”。到目前为止,只有四种蛋白质结构已经通过实验近似得到。由于低同源性、假定的高无序性和有限的结构,在大多数情况下,这些蛋白质的结构预测置信度都较低。在这里,我们研究了最广泛使用的结构和无序性预测器,并评估了它们在新兴蛋白质中的适用性。由于 AlphaFold2 基于多个序列比对的生成,并且是在大量保守和球状蛋白质的已解决结构上进行训练的,因此其在新兴蛋白质上的性能仍然未知。最近,蛋白质的自然语言模型已被用于无对齐结构预测,这使得它们比 AlphaFold2 更适合新兴蛋白质。我们应用了不同的无序性预测器(IUPred3 短/长、flDPnn)和结构预测器,一方面是 AlphaFold2,另一方面是基于语言的模型(Omegafold、ESMfold、RGN2),对四个具有实验结构证据的从头开始的蛋白质进行了预测。我们比较了不同预测器之间以及与现有实验证据的预测结果。IUPred 是使用最广泛的无序性预测器之一,其结果严重依赖于参数的选择,并且与 flDPnn 有很大的不同,后者在最近的一项比较评估研究中被发现优于大多数其他预测器。同样,不同的结构预测器对新兴蛋白质的预测结果和置信度评分也存在差异。我们建议,虽然在某些情况下,基于蛋白质语言模型的方法可能比 AlphaFold2 更准确,但任何预测器(无论是无序性预测器还是结构预测器)对新兴蛋白质的结构预测仍然是一项具有挑战性的任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3347/10126731/a78c8b55b78d/f1000research-12-143204-g0000.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验