Sorbonne Université, CNRS, Laboratoire Jean Perrin (UMR 8237), F-75005 Paris, France.
Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.
PLoS Comput Biol. 2019 Oct 14;15(10):e1007179. doi: 10.1371/journal.pcbi.1007179. eCollection 2019 Oct.
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We further demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known direct physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We finally discuss how to distinguish physically interacting proteins from proteins that only share a common evolutionary history.
确定哪些蛋白质相互作用对于系统水平理解细胞至关重要。最近,基于直接耦合分析(DCA)成对最大熵模型的算法已经能够从序列数据中识别出同源蛋白的相互作用伙伴。DCA 在预测蛋白质-蛋白质相互作用方面的成功可能主要基于其已知的能力,即识别在蛋白质复合物三维结构中相互接触并共同进化以保持物理化学互补的残基对。然而,相互作用的蛋白质具有相似的进化历史。在基于 DCA 的方法推断相互作用伙伴的性能中,纯粹的系统发育相关性的作用是什么?为了解决这个问题,我们使用仅涉及系统发育且没有相互作用或接触的受控合成数据。我们发现 DCA 准确地识别出共享进化历史的合成序列对。虽然系统发育相关性会混淆 DCA 识别接触残基的能力,但它们对于预测同源蛋白的相互作用伙伴是有用的。我们发现 DCA 在这方面的表现与系统发育方法一样好,并且在使用大型和准确的训练集时略好于它们。在迭代配对算法(IPA)中使用 DCA 或系统发育方法可以在没有训练集的情况下预测进化伙伴对。我们进一步证明了这些各种方法在正确预测具有基因组邻近性但没有已知直接物理相互作用的真实同源蛋白配对方面的能力,说明了系统发育相关性在自然数据中的重要性。然而,对于具有物理相互作用和强烈共同进化的蛋白质,DCA 和互信息优于系统发育方法。最后,我们讨论了如何区分具有物理相互作用的蛋白质和仅共享共同进化历史的蛋白质。