Basic Science Institute, Changwon National University, Changwon 51140, Korea.
Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea.
Molecules. 2022 Jun 9;27(12):3711. doi: 10.3390/molecules27123711.
Sequence-structure alignment for protein sequences is an important task for the template-based modeling of 3D structures of proteins. Building a reliable sequence-structure alignment is a challenging problem, especially for remote homologue target proteins. We built a method of sequence-structure alignment called CRFalign, which improves upon a base alignment model based on HMM-HMM comparison by employing pairwise conditional random fields in combination with nonlinear scoring functions of structural and sequence features. Nonlinear scoring part is implemented by a set of gradient boosted regression trees. In addition to sequence profile features, various position-dependent structural features are employed including secondary structures and solvent accessibilities. Training is performed on reference alignments at superfamily levels or twilight zone chosen from the SABmark benchmark set. We found that CRFalign method produces relative improvement in terms of average alignment accuracies for validation sets of SABmark benchmark. We also tested CRFalign on 51 sequence-structure pairs involving 15 FM target domains of CASP14, where we could see that CRFalign leads to an improvement in average modeling accuracies in these hard targets (TM-CRFalign ≃42.94%) compared with that of HHalign (TM-HHalign ≃39.05%) and also that of MRFalign (TM-MRFalign ≃36.93%). CRFalign was incorporated to our template search framework called CRFpred and was tested for a random target set of 300 target proteins consisting of Easy, Medium and Hard sets which showed a reasonable template search performance.
蛋白质序列的结构-序列比对是基于模板的蛋白质 3D 结构建模的一项重要任务。建立可靠的结构-序列比对是一个具有挑战性的问题,特别是对于远程同源目标蛋白质。我们构建了一种称为 CRFalign 的结构-序列比对方法,该方法通过使用成对条件随机场结合结构和序列特征的非线性评分函数改进了基于 HMM-HMM 比较的基础比对模型。非线性评分部分由一组梯度提升回归树实现。除了序列轮廓特征外,还使用了各种位置相关的结构特征,包括二级结构和溶剂可及性。训练是在来自 SABmark 基准集的超家族级别或黄昏带的参考比对上进行的。我们发现,CRFalign 方法在 SABmark 基准验证集的平均比对精度方面产生了相对提高。我们还在涉及 CASP14 的 15 个 FM 目标结构域的 51 个结构-序列对中测试了 CRFalign,在这些硬目标中,我们可以看到 CRFalign 导致平均建模精度提高(TM-CRFalign≈42.94%)与 HHalign(TM-HHalign≈39.05%)和 MRFalign(TM-MRFalign≈36.93%)相比。CRFalign 被整合到我们的模板搜索框架 CRFpred 中,并对由易、中、难三组组成的 300 个随机目标蛋白集进行了测试,结果显示出合理的模板搜索性能。