Wilkins Gary R, Lugo-Martinez Jose, Murphy Robert F
Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University.
bioRxiv. 2024 Oct 25:2024.10.25.620244. doi: 10.1101/2024.10.25.620244.
The interactions of proteins to form complexes play a crucial role in cell function. Data on protein-protein or pairwise interactions (PPI) typically come from a combination of sample separation and mass spectrometry. Since 2010, several extensive, high-throughput mass spectrometry-based experimental studies have dramatically expanded public repositories for PPI data and, by extension, our knowledge of protein complexes. Unfortunately, challenges of limited overlap between experiments, modality-oriented biases, and prohibitive costs of experimental reproducibility continue to limit coverage of the human protein assembly map, both underscoring the need for and spurring the development of relevant computational approaches. Here, we present a new method for predicting the strength of protein interactions. It addresses two important issues that have limited past PPI prediction approaches: incomplete feature sets and incomplete proteome coverage. For a given collection of protein pairs, we fused data from heterogeneous sources into a feature matrix and identified the minimal set of feature partitions for which a non-empty set of protein pairs had complete values. For each such feature partition, we trained a classifier to predict PPI probabilities. We then calculated an overall prediction for a given protein pair by weighting the probabilities from all models that applied to that pair. Our approach accurately identified known and highly probable PPI, far exceeding the performance of current approaches and providing more complete proteome coverage. We then used the predicted probabilities to assemble complexes using previously-described graph-based tools and clustering algorithms and again obtained improved results. Lastly, we used features for three human cell lines to predict PPI and complex scores and identified complexes predicted to differ between those cell lines.
蛋白质相互作用形成复合物在细胞功能中起着至关重要的作用。蛋白质 - 蛋白质或成对相互作用(PPI)的数据通常来自样品分离和质谱分析的结合。自2010年以来,几项基于高通量质谱的广泛实验研究极大地扩展了PPI数据的公共储存库,进而扩展了我们对蛋白质复合物的认识。不幸的是,实验之间重叠有限、面向模式的偏差以及实验可重复性的高昂成本等挑战,继续限制了人类蛋白质组装图谱的覆盖范围,这既凸显了对相关计算方法的需求,也推动了其发展。在此,我们提出了一种预测蛋白质相互作用强度的新方法。它解决了过去PPI预测方法所受限制的两个重要问题:特征集不完整和蛋白质组覆盖不完整。对于给定的蛋白质对集合,我们将来自异构源的数据融合到一个特征矩阵中,并确定了一组最小的特征分区,对于这组分区,有一组非空的蛋白质对具有完整的值。对于每个这样的特征分区,我们训练了一个分类器来预测PPI概率。然后,我们通过对应用于该蛋白质对的所有模型的概率进行加权,计算出给定蛋白质对的总体预测。我们的方法准确地识别出已知的和极有可能的PPI,远远超过了当前方法的性能,并提供了更完整的蛋白质组覆盖范围。然后,我们使用预测概率,通过先前描述的基于图的工具和聚类算法来组装复合物,再次获得了改进的结果。最后,我们使用三种人类细胞系的特征来预测PPI和复合物得分,并识别出预计在这些细胞系之间存在差异的复合物。