Halpin Jackson C, Keating Amy E
MIT Department of Biology, 77 Massachusetts Ave., Cambridge, MA 02139.
MIT Department of Biological Engineering, 77 Massachusetts Ave., Cambridge, MA 02139.
bioRxiv. 2024 Jul 24:2024.07.23.604860. doi: 10.1101/2024.07.23.604860.
Protein-protein interactions are often mediated by a modular peptide recognition domain binding to a short linear motif (SLiM) in the disordered region of another protein. The ability to predict domain-SLiM interactions would allow researchers to map protein interaction networks, predict the effects of perturbations to those networks, and develop biologically meaningful hypotheses. Unfortunately, sequence database searches for SLiMs generally yield mostly biologically irrelevant motif matches or false positives. To improve the prediction of novel SLiM interactions, researchers employ filters to discriminate between biologically relevant and improbable motif matches. One promising criterion for identifying biologically relevant SLiMs is the sequence conservation of the motif, exploiting the fact that functional motifs are more likely to be conserved than spurious motif matches. However, the difficulty of aligning disordered regions has significantly hampered the utility of this approach. We present PairK (pairwise k-mer alignment), an MSA-free method to quantify motif conservation in disordered regions. PairK outperforms both standard MSA-based conservation scores and a modern LLM-based conservation score predictor on the task of identifying biologically important motif instances. PairK can quantify conservation over wider phylogenetic distances than MSAs, indicating that SLiMs may be more conserved than is implied by MSA-based metrics. PairK is available as open-source code at https://github.com/jacksonh1/pairk.
蛋白质-蛋白质相互作用通常由一个模块化的肽识别结构域介导,该结构域与另一种蛋白质无序区域中的短线性基序(SLiM)结合。预测结构域-SLiM相互作用的能力将使研究人员能够绘制蛋白质相互作用网络,预测对这些网络的扰动影响,并提出具有生物学意义的假设。不幸的是,在序列数据库中搜索SLiM通常会产生大多与生物学无关的基序匹配或假阳性结果。为了改进对新型SLiM相互作用的预测,研究人员采用过滤器来区分生物学相关和不太可能的基序匹配。一种有前景的识别生物学相关SLiM的标准是基序的序列保守性,利用功能基序比虚假基序匹配更可能保守这一事实。然而,比对无序区域的困难严重阻碍了这种方法的实用性。我们提出了PairK(成对k-mer比对),一种无需多序列比对(MSA)的方法来量化无序区域中的基序保守性。在识别生物学上重要的基序实例任务中,PairK的表现优于基于标准MSA的保守性评分和基于现代语言模型的保守性评分预测器。与MSA相比,PairK可以在更广泛的系统发育距离上量化保守性,这表明SLiM可能比基于MSA的指标所暗示的更保守。PairK可在https://github.com/jacksonh1/pairk上作为开源代码获取。