开放蛋白质集：大规模结构生物学的训练数据。

OpenProteinSet: Training data for structural biology at scale.

作者信息

Ahdritz Gustaf, Bouatta Nazim, Kadyan Sachin, Jarosch Lukas, Berenberg Daniel, Fisk Ian, Watkins Andrew M, Ra Stephen, Bonneau Richard, AlQuraishi Mohammed

机构信息

Harvard University.

Laboratory of Systems Pharmacology, Harvard Medical School.

出版信息

ArXiv. 2023 Aug 10:arXiv:2308.05326v1.

PMID:37608940

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10441447/

Abstract

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

摘要

蛋白质的多序列比对（MSA）编码了丰富的生物学信息，几十年来一直是生物信息学方法中用于蛋白质设计和蛋白质结构预测等任务的主力军。最近像AlphaFold2这样的突破，利用变换器直接处理大量原始MSA，再次证实了它们的重要性。然而，MSA的生成计算量极大，而且研究界尚未获得与用于训练AlphaFold2的数据集相当的数据集，这阻碍了蛋白质机器学习的进展。为了解决这个问题，我们引入了OpenProteinSet，这是一个包含超过1600万个MSA、来自蛋白质数据库的相关结构同源物以及AlphaFold2蛋白质结构预测的开源语料库。我们之前通过在其上成功重新训练AlphaFold2证明了OpenProteinSet的实用性。我们预计OpenProteinSet作为训练和验证数据将广泛应用于：1）专注于蛋白质结构、功能和设计的各种任务；2）大规模多模态机器学习研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9487/10441447/c8374a408296/nihpp-2308.05326v1-f0001.jpg

相似文献

OpenProteinSet: Training data for structural biology at scale.开放蛋白质集：大规模结构生物学的训练数据。

ArXiv. 2023 Aug 10:arXiv:2308.05326v1.

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.基于可微分 Smith-Waterman 的多序列比对端到端学习。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.

Improved protein structure prediction with trRosettaX2, AlphaFold2, and optimized MSAs in CASP15.利用 trRosettaX2、AlphaFold2 和优化的 MSAs 在 CASP15 中提高蛋白质结构预测。

Proteins. 2023 Dec;91(12):1704-1711. doi: 10.1002/prot.26570. Epub 2023 Aug 10.

SETH predicts nuances of residue disorder from protein embeddings.SETH从蛋白质嵌入中预测残基无序的细微差别。

Front Bioinform. 2022 Oct 10;2:1019597. doi: 10.3389/fbinf.2022.1019597. eCollection 2022.

Petabase-Scale Homology Search for Structure Prediction.用于结构预测的PB级同源性搜索

Cold Spring Harb Perspect Biol. 2024 May 2;16(5):a041465. doi: 10.1101/cshperspect.a041465.

Petascale Homology Search for Structure Prediction.用于结构预测的千万亿次同源性搜索

bioRxiv. 2023 Jul 11:2023.07.10.548308. doi: 10.1101/2023.07.10.548308.

Improving AlphaFold2-based protein tertiary structure prediction with MULTICOM in CASP15.在蛋白质结构预测关键评估第15轮（CASP15）中使用MULTICOM改进基于AlphaFold2的蛋白质三级结构预测

Commun Chem. 2023 Sep 7;6(1):188. doi: 10.1038/s42004-023-00991-6.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.利用多序列比对增强和预训练语言模型提高同源蛋白不足的结构相关预测。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad217.

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.OpenFold：重新训练 AlphaFold2 可深入了解其学习机制和泛化能力。

Nat Methods. 2024 Aug;21(8):1514-1524. doi: 10.1038/s41592-024-02272-z. Epub 2024 May 14.

Overview of AlphaFold2 and breakthroughs in overcoming its limitations.AlphaFold2 概述及克服其局限性的突破。

Comput Biol Med. 2024 Jun;176:108620. doi: 10.1016/j.compbiomed.2024.108620. Epub 2024 May 15.

本文引用的文献

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.OpenFold：重新训练 AlphaFold2 可深入了解其学习机制和泛化能力。

Nat Methods. 2024 Aug;21(8):1514-1524. doi: 10.1038/s41592-024-02272-z. Epub 2024 May 14.

Progress at protein structure prediction, as seen in CASP15.在 CASP15 中看到的蛋白质结构预测的进展。

Curr Opin Struct Biol. 2023 Jun;80:102594. doi: 10.1016/j.sbi.2023.102594. Epub 2023 Apr 14.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Generative power of a protein language model trained on multiple sequence alignments.基于多序列比对训练的蛋白质语言模型的生成能力。

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

Single-sequence protein structure prediction using a language model and deep learning.基于语言模型和深度学习的单序列蛋白质结构预测。

Nat Biotechnol. 2022 Nov;40(11):1617-1623. doi: 10.1038/s41587-022-01432-w. Epub 2022 Oct 3.

Multimodal model with text and drug embeddings for adverse drug reaction classification.基于文本和药物嵌入的多模态模型用于药物不良反应分类。

J Biomed Inform. 2022 Nov;135:104182. doi: 10.1016/j.jbi.2022.104182. Epub 2022 Sep 30.

'The entire protein universe': AI predicts shape of nearly every known protein.“整个蛋白质世界”：人工智能预测几乎所有已知蛋白质的形状

Nature. 2022 Aug;608(7921):15-16. doi: 10.1038/d41586-022-02083-2.

ColabFold: making protein folding accessible to all.ColabFold：让蛋白质折叠变得人人可用。

Nat Methods. 2022 Jun;19(6):679-682. doi: 10.1038/s41592-022-01488-1. Epub 2022 May 30.

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.AlphaFold 蛋白质结构数据库：用高精度模型极大地扩展蛋白质序列空间的结构覆盖范围。

Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444. doi: 10.1093/nar/gkab1061.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

开放蛋白质集：大规模结构生物学的训练数据。

OpenProteinSet: Training data for structural biology at scale.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献