FAIR, Meta AI, New York, NY, USA.
New York University, New York, NY, USA.
Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
最近机器学习的进展利用了多序列比对中的进化信息来预测蛋白质结构。我们使用大型语言模型展示了从原始序列直接推断全原子级蛋白质结构。随着蛋白质序列语言模型扩展到 150 亿个参数,蛋白质结构的原子分辨率图像在学习的表示中显现出来。这导致了高分辨率结构预测的数量级加速,从而实现了宏基因组蛋白质的大规模结构特征描述。我们应用这种能力通过预测 >6.17 亿个宏基因组蛋白质序列的结构来构建 ESM 宏基因组图谱,包括 >2.25 亿个具有高置信度的预测结构,从而深入了解了天然蛋白质的广泛多样性。