Institute of Bioinformatics, University of Georgia, 30602, Georgia, USA.
School of Computing, University of Georgia, 30602, Georgia, USA.
Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac599.
Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.
蛋白质语言模型是生物信息学中一种快速发展的深度学习方法,具有多种应用,如结构预测和蛋白质设计。然而,其在功能位点预测方面的序列保守性估计应用尚未得到系统的探索。在这里,我们提出了一种使用蛋白质语言模型生成的序列嵌入来进行无对齐估计序列保守性的方法。在公开可用的蛋白质语言模型中进行的综合基准测试表明,ESM2 模型在保守性估计的计算成本方面提供了最佳的性能比。将我们的方法应用于全长蛋白质序列,我们证明基于嵌入的方法不受保守元素顺序的影响——可以在单个运行中计算多结构域蛋白质的保守得分,而无需分离单个结构域。我们的方法还可以识别快速进化序列区域(如结构域插入)中的保守功能位点,我们通过在蛋白激酶的可变插入片段中鉴定保守的磷酸化模体来证明这一点。总的来说,基于嵌入的保守性分析是一种广泛适用于识别任何全长蛋白质序列中潜在功能位点并以无对齐方式估计保守性的方法。要在您感兴趣的蛋白质序列上运行此方法,请访问我们的 GitHub 页面 https://github.com/esbgkannan/kibby 尝试我们的脚本。