Hamamsy Tymor, Barot Meet, Morton James T, Steinegger Martin, Bonneau Richard, Cho Kyunghyun
Center for Data Science, New York University, New York, NY, USA.
Mythos Scientific, NJ, USA.
bioRxiv. 2023 Nov 26:2023.11.26.568742. doi: 10.1101/2023.11.26.568742.
The sequence-structure-function relationships that ultimately generate the diversity of extant observed proteins is complex, as proteins bridge the gap between multiple informational and physical scales involved in nearly all cellular processes. One limitation of existing protein annotation databases such as UniProt is that less than 1% of proteins have experimentally verified functions, and computational methods are needed to fill in the missing information. Here, we demonstrate that a multi-aspect framework based on protein language models can learn sequence-structure-function representations of amino acid sequences, and can provide the foundation for sensitive sequence-structure-function aware protein sequence search and annotation. Based on this model, we introduce a multi-aspect information retrieval system for proteins, Protein-Vec, covering sequence, structure, and function aspects, that enables computational protein annotation and function prediction at tree-of-life scales.
最终产生现存已观察到的蛋白质多样性的序列-结构-功能关系是复杂的,因为蛋白质跨越了几乎所有细胞过程中涉及的多个信息和物理尺度之间的差距。现有蛋白质注释数据库(如UniProt)的一个局限性在于,只有不到1%的蛋白质具有经实验验证的功能,因此需要计算方法来填补缺失的信息。在这里,我们证明了基于蛋白质语言模型的多方面框架可以学习氨基酸序列的序列-结构-功能表示,并可为灵敏的序列-结构-功能感知蛋白质序列搜索和注释提供基础。基于此模型,我们引入了一个用于蛋白质的多方面信息检索系统Protein-Vec,它涵盖序列、结构和功能方面,能够在生命树尺度上进行计算蛋白质注释和功能预测。