Suppr超能文献

开放蛋白质集:大规模结构生物学的训练数据。

OpenProteinSet: Training data for structural biology at scale.

作者信息

Ahdritz Gustaf, Bouatta Nazim, Kadyan Sachin, Jarosch Lukas, Berenberg Daniel, Fisk Ian, Watkins Andrew M, Ra Stephen, Bonneau Richard, AlQuraishi Mohammed

机构信息

Harvard University.

Laboratory of Systems Pharmacology, Harvard Medical School.

出版信息

ArXiv. 2023 Aug 10:arXiv:2308.05326v1.

Abstract

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

摘要

蛋白质的多序列比对(MSA)编码了丰富的生物学信息,几十年来一直是生物信息学方法中用于蛋白质设计和蛋白质结构预测等任务的主力军。最近像AlphaFold2这样的突破,利用变换器直接处理大量原始MSA,再次证实了它们的重要性。然而,MSA的生成计算量极大,而且研究界尚未获得与用于训练AlphaFold2的数据集相当的数据集,这阻碍了蛋白质机器学习的进展。为了解决这个问题,我们引入了OpenProteinSet,这是一个包含超过1600万个MSA、来自蛋白质数据库的相关结构同源物以及AlphaFold2蛋白质结构预测的开源语料库。我们之前通过在其上成功重新训练AlphaFold2证明了OpenProteinSet的实用性。我们预计OpenProteinSet作为训练和验证数据将广泛应用于:1)专注于蛋白质结构、功能和设计的各种任务;2)大规模多模态机器学习研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9487/10441447/c8374a408296/nihpp-2308.05326v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验