Ahdritz Gustaf, Bouatta Nazim, Kadyan Sachin, Jarosch Lukas, Berenberg Daniel, Fisk Ian, Watkins Andrew M, Ra Stephen, Bonneau Richard, AlQuraishi Mohammed
Harvard University.
Laboratory of Systems Pharmacology, Harvard Medical School.
ArXiv. 2023 Aug 10:arXiv:2308.05326v1.
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
蛋白质的多序列比对(MSA)编码了丰富的生物学信息,几十年来一直是生物信息学方法中用于蛋白质设计和蛋白质结构预测等任务的主力军。最近像AlphaFold2这样的突破,利用变换器直接处理大量原始MSA,再次证实了它们的重要性。然而,MSA的生成计算量极大,而且研究界尚未获得与用于训练AlphaFold2的数据集相当的数据集,这阻碍了蛋白质机器学习的进展。为了解决这个问题,我们引入了OpenProteinSet,这是一个包含超过1600万个MSA、来自蛋白质数据库的相关结构同源物以及AlphaFold2蛋白质结构预测的开源语料库。我们之前通过在其上成功重新训练AlphaFold2证明了OpenProteinSet的实用性。我们预计OpenProteinSet作为训练和验证数据将广泛应用于:1)专注于蛋白质结构、功能和设计的各种任务;2)大规模多模态机器学习研究。