使用 ProtAlbert 转换器进行蛋白质序列特征预测。

Protein sequence profile prediction using ProtAlbert transformer.

机构信息

Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran.

出版信息

Comput Biol Chem. 2022 Aug;99:107717. doi: 10.1016/j.compbiolchem.2022.107717. Epub 2022 Jun 26.

DOI:10.1016/j.compbiolchem.2022.107717

Abstract

Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to generate a profile based on the substitution scoring matrix. The profile applications are very dependent on the alignment algorithm and scoring system for amino acid substitution. However, sometimes there are no similar sequences in the database with the query sequence based on the scoring schema. In these cases, it is not possible to make a profile. This paper proposes a method named PA_SPP, based on pre-trained ProtAlbert transformer to predict the profile for a single protein sequence without alignment. The performance of transformers on natural languages is impressive. Protein sequences can be viewed as a language; we can benefit from these models. We analyze the attention heads in different layers of ProtAlbert to show that the transformer can capture five essential protein characteristics of a single sequence. This assessment shows that ProtAlbert considers some protein properties when suggesting amino acids for each position in the sequence. In other words, transformers can be considered an appropriate alternative for alignment and scoring schema to predict a profile. We evaluate PA_SPP on the Casp13 dataset, including 55 proteins. Meanwhile, one thermophilic and two mesophilic proteins are used as case studies. The results display high similarity between the predicted profiles and HSSP profiles.

摘要

构象用于对蛋白质家族和结构域进行建模。它们是通过将查询序列与数据库进行比对，生成基于替换评分矩阵的构象来构建的。构象的应用非常依赖于氨基酸替换的比对算法和评分系统。然而，有时根据评分方案，数据库中没有与查询序列相似的序列。在这种情况下，无法构建构象。本文提出了一种名为 PA_SPP 的方法，它基于预训练的 ProtAlbert 转换器，可以在没有比对的情况下预测单个蛋白质序列的构象。转换器在自然语言上的性能令人印象深刻。蛋白质序列可以看作是一种语言；我们可以从这些模型中受益。我们分析了 ProtAlbert 不同层的注意力头，以表明转换器可以捕获单个序列的五个重要蛋白质特征。这种评估表明，当为序列中的每个位置建议氨基酸时，ProtAlbert 会考虑一些蛋白质特性。换句话说，转换器可以被认为是一种替代比对和评分方案来预测构象的合适选择。我们在 Casp13 数据集上评估了 PA_SPP，其中包含 55 个蛋白质。同时，使用一个嗜热蛋白和两个嗜中温蛋白作为案例研究。结果显示，预测的构象与 HSSP 构象之间具有高度相似性。