Facebook AI Research, New York, NY 10003;
Department of Computer Science, New York University, New York, NY 10012.
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
在人工智能领域,通过无监督学习实现的数据规模和模型容量的结合,推动了表示学习和统计生成方面的重大进展。在生命科学领域,测序技术的预期发展有望带来前所未有的自然序列多样性数据。在进化尺度上进行蛋白质语言建模是实现生物学领域预测性和生成性人工智能的合乎逻辑的步骤。为此,我们使用无监督学习在跨越进化多样性的 2.5 亿个蛋白质序列的 860 亿个氨基酸上训练深度上下文语言模型。由此产生的模型在其表示中包含关于生物特性的信息。这些表示是仅从序列数据中学习到的。所学习的表示空间具有多尺度组织,反映了从氨基酸生化特性到蛋白质远程同源性的结构。关于二级和三级结构的信息被编码在表示中,并可以通过线性投影来识别。表示学习产生的特征可以在一系列应用中泛化,能够实现突变效应和二级结构的最先进的监督预测,并改进用于远程接触预测的最先进特征。