评估机器学习时代新兴蛋白质的结构和无序预测工具。

Assessing structure and disorder prediction tools for emerged proteins in the age of machine learning.

机构信息

Institute for Evolution and Bidiversity, University of Muenster, Muenster, 48149, Germany.

Department Protein Evolution, Max Planck-Institute for Biology, Tuebingen, 72076, Germany.

出版信息

F1000Res. 2023 Mar 29;12:347. doi: 10.12688/f1000research.130443.1. eCollection 2023.

DOI:10.12688/f1000research.130443.1

PMID:37113259

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10126731/

Abstract

protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded proteins belong to the so-called "dark protein space". So far, only four protein structures have been experimentally approximated. Low homology, presumed high disorder and limited structures result in low confidence structural predictions for proteins in most cases. Here, we look at the most widely used structure and disorder predictors and assess their applicability for emerged proteins. Since AlphaFold2 is based on the generation of multiple sequence alignments and was trained on solved structures of largely conserved and globular proteins, its performance on proteins remains unknown. More recently, natural language models of proteins have been used for alignment-free structure predictions, potentially making them more suitable for proteins than AlphaFold2. We applied different disorder predictors (IUPred3 short/long, flDPnn) and structure predictors, AlphaFold2 on the one hand and language-based models (Omegafold, ESMfold, RGN2) on the other hand, to four de novo proteins with experimental evidence on structure. We compared the resulting predictions between the different predictors as well as to the existing experimental evidence. Results from IUPred, the most widely used disorder predictor, depend heavily on the choice of parameters and differ significantly from flDPnn which has been found to outperform most other predictors in a comparative assessment study recently. Similarly, different structure predictors yielded varying results and confidence scores for proteins. We suggest that, while in some cases protein language model based approaches might be more accurate than AlphaFold2, the structure prediction of emerged proteins remains a difficult task for any predictor, be it disorder or structure.

摘要

蛋白质编码基因从基因组的非编码区域中全新出现，根据定义，与其他基因没有同源性。因此，它们编码的蛋白质属于所谓的“暗蛋白质空间”。到目前为止，只有四种蛋白质结构已经通过实验近似得到。由于低同源性、假定的高无序性和有限的结构，在大多数情况下，这些蛋白质的结构预测置信度都较低。在这里，我们研究了最广泛使用的结构和无序性预测器，并评估了它们在新兴蛋白质中的适用性。由于 AlphaFold2 基于多个序列比对的生成，并且是在大量保守和球状蛋白质的已解决结构上进行训练的，因此其在新兴蛋白质上的性能仍然未知。最近，蛋白质的自然语言模型已被用于无对齐结构预测，这使得它们比 AlphaFold2 更适合新兴蛋白质。我们应用了不同的无序性预测器（IUPred3 短/长、flDPnn）和结构预测器，一方面是 AlphaFold2，另一方面是基于语言的模型（Omegafold、ESMfold、RGN2），对四个具有实验结构证据的从头开始的蛋白质进行了预测。我们比较了不同预测器之间以及与现有实验证据的预测结果。IUPred 是使用最广泛的无序性预测器之一，其结果严重依赖于参数的选择，并且与 flDPnn 有很大的不同，后者在最近的一项比较评估研究中被发现优于大多数其他预测器。同样，不同的结构预测器对新兴蛋白质的预测结果和置信度评分也存在差异。我们建议，虽然在某些情况下，基于蛋白质语言模型的方法可能比 AlphaFold2 更准确，但任何预测器（无论是无序性预测器还是结构预测器）对新兴蛋白质的结构预测仍然是一项具有挑战性的任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3347/10126731/a78c8b55b78d/f1000research-12-143204-g0000.jpg

相似文献

Assessing structure and disorder prediction tools for emerged proteins in the age of machine learning.评估机器学习时代新兴蛋白质的结构和无序预测工具。

F1000Res. 2023 Mar 29;12:347. doi: 10.12688/f1000research.130443.1. eCollection 2023.

Random, de novo, and conserved proteins: How structure and disorder predictors perform differently.随机、从头开始和保守的蛋白质：结构和无序预测器的表现有何不同。

Proteins. 2024 Jun;92(6):757-767. doi: 10.1002/prot.26652. Epub 2024 Jan 16.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.利用多序列比对增强和预训练语言模型提高同源蛋白不足的结构相关预测。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad217.

Sequence, Structure, and Functional Space of Drosophila De Novo Proteins.果蝇从头蛋白的序列、结构和功能空间。

Genome Biol Evol. 2024 Aug 5;16(8). doi: 10.1093/gbe/evae176.

Prediction of protein mononucleotide binding sites using AlphaFold2 and machine learning.使用 AlphaFold2 和机器学习预测蛋白质单核苷酸结合位点。

Comput Biol Chem. 2022 Oct;100:107744. doi: 10.1016/j.compbiolchem.2022.107744. Epub 2022 Jul 23.

De novo protein design by inversion of the AlphaFold structure prediction network.通过反转 AlphaFold 结构预测网络进行从头设计蛋白质。

Protein Sci. 2023 Jun;32(6):e4653. doi: 10.1002/pro.4653.

SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。

Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.

Validation of de novo designed water-soluble and transmembrane β-barrels by in silico folding and melting.从头设计的水溶性和跨膜β-桶的通过计算折叠和熔融的验证。

Protein Sci. 2024 Jul;33(7):e5033. doi: 10.1002/pro.5033.

The origin and structural evolution of genes in .……中基因的起源与结构演化。（你提供的原文不完整，“in”后面缺少具体内容，以上是根据现有内容尽量准确翻译的结果。）

bioRxiv. 2023 Jun 27:2023.03.13.532420. doi: 10.1101/2023.03.13.532420.

High-throughput Selection of Human de novo-emerged sORFs with High Folding Potential.高通量筛选具有高折叠潜力的人从头出现的 sORF。

Genome Biol Evol. 2024 Apr 2;16(4). doi: 10.1093/gbe/evae069.

引用本文的文献

Expression of Random Sequences and de novo Evolved Genes From the Mouse in Human Cells Reveals Functional Diversity and Specificity.小鼠随机序列和从头进化基因在人类细胞中的表达揭示了功能多样性和特异性。

Genome Biol Evol. 2024 Dec 4;16(12). doi: 10.1093/gbe/evae175.

Cellular processing of beneficial emerging proteins.有益新兴蛋白质的细胞加工。

bioRxiv. 2024 Aug 29:2024.08.28.610198. doi: 10.1101/2024.08.28.610198.

Sequence, Structure, and Functional Space of Drosophila De Novo Proteins.果蝇从头蛋白的序列、结构和功能空间。

Genome Biol Evol. 2024 Aug 5;16(8). doi: 10.1093/gbe/evae176.

Are Most Human-Specific Proteins Encoded by Long Noncoding RNAs?大多数人类特异性蛋白是否由长非编码 RNA 编码？

J Mol Evol. 2024 Aug;92(4):363-370. doi: 10.1007/s00239-024-10174-z. Epub 2024 Jun 25.

The Rapid Evolution of De Novo Proteins in Structure and Complex.从头蛋白质在结构和复杂性上的快速进化。

Genome Biol Evol. 2024 Jun 4;16(6). doi: 10.1093/gbe/evae107.

High-throughput Selection of Human de novo-emerged sORFs with High Folding Potential.高通量筛选具有高折叠潜力的人从头出现的 sORF。

Genome Biol Evol. 2024 Apr 2;16(4). doi: 10.1093/gbe/evae069.

The origin and structural evolution of de novo genes in Drosophila.果蝇中从头起源基因的起源与结构演化

Nat Commun. 2024 Jan 27;15(1):810. doi: 10.1038/s41467-024-45028-1.

本文引用的文献

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.OpenFold：重新训练 AlphaFold2 可深入了解其学习机制和泛化能力。

Nat Methods. 2024 Aug;21(8):1514-1524. doi: 10.1038/s41592-024-02272-z. Epub 2024 May 14.

Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2.利用 AlphaFold2 系统识别条件折叠的固有无序区域。

Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2304302120. doi: 10.1073/pnas.2304302120. Epub 2023 Oct 25.

Fast and accurate protein structure search with Foldseek.使用 Foldseek 进行快速准确的蛋白质结构搜索。

Nat Biotechnol. 2024 Feb;42(2):243-246. doi: 10.1038/s41587-023-01773-0. Epub 2023 May 8.

Do "Newly Born" orphan proteins resemble "Never Born" proteins? A study using three deep learning algorithms.新生孤儿蛋白是否类似于从未诞生的蛋白？使用三种深度学习算法的研究。

Proteins. 2023 Aug;91(8):1097-1115. doi: 10.1002/prot.26496. Epub 2023 Apr 24.

Experimental characterization of de novo proteins and their unevolved random-sequence counterparts.从头蛋白质及其未经进化的随机序列对应物的实验特性分析。

Nat Ecol Evol. 2023 Apr;7(4):570-580. doi: 10.1038/s41559-023-02010-2. Epub 2023 Apr 6.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms.AlphaFold2 揭示了 21 个模式生物的蛋白质结构空间中的共性和新颖性。

Commun Biol. 2023 Feb 8;6(1):160. doi: 10.1038/s42003-023-04488-9.

Folding the unfoldable: using AlphaFold to explore spurious proteins.折叠不可折叠之物：利用AlphaFold探索假蛋白

Bioinform Adv. 2022 Jan 9;2(1):vbab043. doi: 10.1093/bioadv/vbab043. eCollection 2022.

Deep embedding and alignment of protein sequences.蛋白质序列的深度嵌入与比对

Nat Methods. 2023 Jan;20(1):104-111. doi: 10.1038/s41592-022-01700-2. Epub 2022 Dec 15.

Novel machine learning approaches revolutionize protein knowledge.新型机器学习方法彻底改变了蛋白质知识。

Trends Biochem Sci. 2023 Apr;48(4):345-359. doi: 10.1016/j.tibs.2022.11.001. Epub 2022 Dec 9.

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.基于可微分 Smith-Waterman 的多序列比对端到端学习。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估机器学习时代新兴蛋白质的结构和无序预测工具。

Assessing structure and disorder prediction tools for emerged proteins in the age of machine learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献