Zubek Julian, Tatjewski Marcin, Boniecki Adam, Mnich Maciej, Basu Subhadip, Plewczynski Dariusz
Centre of New Technologies, University of Warsaw , Warsaw , Poland ; Institute of Computer Science, Polish Academy of Sciences , Warsaw , Poland.
Faculty of Mathematics, Informatics and Mechanics, University of Warsaw , Warsaw , Poland.
PeerJ. 2015 Jul 2;3:e1041. doi: 10.7717/peerj.1041. eCollection 2015.
Accurate identification of protein-protein interactions (PPI) is the key step in understanding proteins' biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein-protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein-protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).
准确识别蛋白质-蛋白质相互作用(PPI)是理解蛋白质生物学功能的关键步骤,而蛋白质的生物学功能通常依赖于上下文。许多现有的PPI预测器依赖于蛋白质序列的聚合特征,然而只有少数方法利用了特定残基接触的局部信息。在这项工作中,我们提出了一种用于预测蛋白质-蛋白质相互作用的两阶段机器学习方法。我们从蛋白质数据库(PDB)中可获得的酿酒酵母蛋白质复合物的经过仔细筛选的数据开始。首先,我们基于相互作用和非相互作用序列片段对的残基间距离构建线性描述。其次,我们训练机器学习分类器来预测任意两个短序列片段之间的二元片段相互作用。蛋白质-蛋白质相互作用的最终预测是使用所分析的两种蛋白质的所有可能相互作用序列片段的全对全二维矩阵表示来完成的。一级预测器在微观尺度(即残基水平预测)上的AUC达到0.88。二级预测器通过更复杂的学习范式进一步改善了结果。我们进行了30倍宏观尺度(即蛋白质水平)的交叉验证实验。使用PSIPRED预测的二级结构的二级预测器达到了0.70的精确率、0.68的召回率和0.70的AUC,而其他流行方法提供的结果低于0.6阈值(召回率、精确率、AUC)。我们的结果表明,与其他序列表示相比,多尺度序列特征聚合过程能够将机器学习结果提高10%以上。我们实验管道的准备好的数据集和源代码可从以下网址免费下载:http://zubekj.github.io/mlppi/(开源Python实现,与操作系统无关)。