Suppr超能文献

生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

机构信息

Facebook AI Research, New York, NY 10003;

Department of Computer Science, New York University, New York, NY 10012.

出版信息

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Abstract

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.

摘要

在人工智能领域,通过无监督学习实现的数据规模和模型容量的结合,推动了表示学习和统计生成方面的重大进展。在生命科学领域,测序技术的预期发展有望带来前所未有的自然序列多样性数据。在进化尺度上进行蛋白质语言建模是实现生物学领域预测性和生成性人工智能的合乎逻辑的步骤。为此,我们使用无监督学习在跨越进化多样性的 2.5 亿个蛋白质序列的 860 亿个氨基酸上训练深度上下文语言模型。由此产生的模型在其表示中包含关于生物特性的信息。这些表示是仅从序列数据中学习到的。所学习的表示空间具有多尺度组织,反映了从氨基酸生化特性到蛋白质远程同源性的结构。关于二级和三级结构的信息被编码在表示中,并可以通过线性投影来识别。表示学习产生的特征可以在一系列应用中泛化,能够实现突变效应和二级结构的最先进的监督预测,并改进用于远程接触预测的最先进特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41cc/8053943/837e62287634/pnas.2016239118fig01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验