Univ Rennes, Inria, CNRS, IRISA, Rennes, France.
BMC Bioinformatics. 2021 Jun 10;22(1):317. doi: 10.1186/s12859-021-04222-4.
To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use.
We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between [Formula: see text] and [Formula: see text]) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ([Formula: see text] in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean [Formula: see text] score and finds significantly better alignments than HHalign and PPalign without couplings in some cases.
These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign's guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.
为了将不断增加的测序蛋白质赋予结构和功能注释,主要方法依赖于基于序列的同源搜索方法,例如 BLAST 或当前基于轮廓隐马尔可夫模型的最先进方法,这些方法依赖于查询序列与注释蛋白质或蛋白质家族的显著对齐。虽然这些方法很强大,但它们没有考虑残基之间的共进化。利用接触预测领域的最新进展,我们在这里提出通过 Potts 模型来表示蛋白质,该模型除了位置组成外还可以对位置之间的直接耦合进行建模,并通过对齐这些模型来比较蛋白质。由于存在非局部依赖性,因此对齐 Potts 模型的问题很困难,并且仍然是其使用的主要计算瓶颈。
我们在这里引入了问题的整数线性规划公式,并引入了 PPalign,这是一个基于该公式的程序,用于在可处理的时间内计算代表蛋白质的 Potts 模型的最优两两对齐。该方法通过 SISYPHUS 基准的非冗余参考序列比对集进行评估,该基准具有最低的序列同一性(在 [Formula: see text] 和 [Formula: see text] 之间),并且能够为要对齐的每个序列构建可靠的 Potts 模型。该实验证实 Potts 模型可以在合理的时间内进行对齐(在这些比对上平均为 [Formula: see text])。与 HHalign 和独立位点 PPalign 相比,评估了耦合的贡献。尽管 Potts 模型没有针对对齐目的进行完全优化,并且使用了简单的空位得分,但 PPalign 在某些情况下产生了更好的平均 [Formula: see text] 得分,并找到了比 HHalign 和没有耦合的 PPalign 更好的对齐。
这些结果表明,来自蛋白质 Potts 模型的成对耦合可以用于在可处理的时间内改进远程相关蛋白质序列的对齐。我们的实验还表明,现在需要对 Potts 模型的推断进行新的研究,以使它们更具可比性并适合同源搜索。我们认为 PPalign 的保证最优性将是进行这一方向的无偏研究的有力资产。