Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa068.
In recent years, high-throughput experimental techniques have significantly enhanced the accuracy and coverage of protein-protein interaction identification, including human-pathogen protein-protein interactions (HP-PPIs). Despite this progress, experimental methods are, in general, expensive in terms of both time and labour costs, especially considering that there are enormous amounts of potential protein-interacting partners. Developing computational methods to predict interactions between human and bacteria pathogen has thus become critical and meaningful, in both facilitating the detection of interactions and mining incomplete interaction maps. In this paper, we present a systematic evaluation of machine learning-based computational methods for human-bacterium protein-protein interactions (HB-PPIs). We first reviewed a vast number of publicly available databases of HP-PPIs and then critically evaluate the availability of these databases. Benefitting from its well-structured nature, we subsequently preprocess the data and identified six bacterium pathogens that could be used to study bacterium subjects in which a human was the host. Additionally, we thoroughly reviewed the literature on 'host-pathogen interactions' whereby existing models were summarized that we used to jointly study the impact of different feature representation algorithms and evaluate the performance of existing machine learning computational models. Owing to the abundance of sequence information and the limited scale of other protein-related information, we adopted the primary protocol from the literature and dedicated our analysis to a comprehensive assessment of sequence information and machine learning models. A systematic evaluation of machine learning models and a wide range of feature representation algorithms based on sequence information are presented as a comparison survey towards the prediction performance evaluation of HB-PPIs.
近年来,高通量实验技术显著提高了蛋白质-蛋白质相互作用识别的准确性和覆盖范围,包括人类病原体蛋白质-蛋白质相互作用(HP-PPIs)。尽管取得了这些进展,但实验方法在时间和劳动力成本方面通常都很昂贵,特别是考虑到有大量潜在的蛋白质相互作用伙伴。因此,开发用于预测人类和细菌病原体之间相互作用的计算方法变得至关重要且有意义,这有助于检测相互作用和挖掘不完整的相互作用图谱。在本文中,我们对基于机器学习的人类-细菌蛋白质-蛋白质相互作用(HB-PPIs)计算方法进行了系统评估。我们首先回顾了大量公开的 HP-PPIs 数据库,然后批判性地评估了这些数据库的可用性。得益于其良好的结构化性质,我们随后对数据进行预处理,并确定了六个可以用于研究以人类为宿主的细菌病原体的细菌病原体。此外,我们还对“宿主-病原体相互作用”的文献进行了深入回顾,总结了现有的模型,我们使用这些模型共同研究不同特征表示算法的影响,并评估现有机器学习计算模型的性能。由于序列信息丰富,而其他与蛋白质相关的信息规模有限,我们采用了文献中的主要方案,并将分析重点放在全面评估序列信息和机器学习模型上。基于序列信息的机器学习模型和广泛的特征表示算法的系统评估作为 HB-PPIs 预测性能评估的比较调查呈现。