Heinzinger Michael, Littmann Maria, Sillitoe Ian, Bordin Nicola, Orengo Christine, Rost Burkhard
TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.
Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.
NAR Genom Bioinform. 2022 Jun 11;4(2):lqac043. doi: 10.1093/nargab/lqac043. eCollection 2022 Jun.
Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed , has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.
实验结构通过多序列比对,或者更一般地通过基于同源性的推断(HBI)来利用,这有助于将信息从具有已知注释的蛋白质转移到没有任何注释的查询蛋白质上。最近的一种替代方法将HBI的概念从序列距离查找扩展到基于嵌入的注释转移(EAT)。这些嵌入是从蛋白质语言模型(pLMs)中衍生出来的。在这里,我们介绍使用来自pLMs的单个蛋白质表示进行对比学习。这种学习过程创建了一组新的嵌入,该嵌入优化了由CATH资源定义的蛋白质3D结构的层次分类所捕获的约束。这种被称为 的方法,与诸如穿线法或折叠识别等更传统的技术相比,具有更强的识别远距离同源关系的能力。因此,这些嵌入使得序列比较能够进入蛋白质相似性的“午夜区”,即远缘相关序列具有看似随机的成对序列相似性的区域。这项工作的新颖之处在于工具和采样技术的特定组合,其确定的性能与现有的最先进序列比较方法相当或更好。此外,由于该方法不需要生成比对,因此速度也快几个数量级。代码可在https://github.com/Rostlab/EAT获取。