NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, 28 Medical Drive, Singapore 117456.
BMC Genomics. 2011 Nov 30;12 Suppl 3(Suppl 3):S20. doi: 10.1186/1471-2164-12-S3-S20.
M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al's recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions.
To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252 predicted physical interologs from IntAct and S. aureus MRSA252 pull-down PPIs. Comparative analysis with several representative two-hybrid PPI datasets in other species further confirms that the H37Rv B2H PPI dataset is of low quality. Next, to test the possibility that the H37Rv STRING PPIs are not purely direct physical interactions, we compare M. tuberculosis H37Rv protein pairs that catalyze adjacent steps in enzymatic reactions to B2H PPIs and predicted PPIs in STRING, which shows it has much lower similarities with the B2H PPIs than with STRING PPIs. This result strongly suggests that the H37Rv STRING PPIs more likely correspond to indirect relationships between protein pairs than to B2H PPIs. For more precise support, we turn to S. cerevisiae for its comprehensively studied interactome. We compare S. cerevisiae predicted PPIs in STRING to three independent protein relationship datasets which respectively comprise PPIs reported in Y2H assays, protein pairs reported to be in the same protein complexes, and protein pairs that catalyze successive reaction steps in enzymatic reactions. Our analysis reveals that S. cerevisiae predicted STRING PPIs have much higher similarity to the latter two types of protein pairs than to two-hybrid PPIs. As H37Rv STRING PPIs are predicted using similar methods as S. cerevisiae predicted STRING PPIs, this suggests that these H37Rv STRING PPIs are more likely to correspond to the latter two types of protein pairs rather than to two-hybrid PPIs as well.
The H37Rv B2H PPI dataset has low quality. It should not be used as the gold standard to assess the quality of other (possibly predicted) H37Rv PPI datasets. The H37Rv STRING PPI dataset also has low quality; nevertheless, a subset consisting of STRING PPIs with score ≥770 has satisfactory quality. However, these STRING "PPIs" should be interpreted as functional associations, which include a substantial portion of indirect protein interactions, rather than direct physical interactions. These two factors cause the strikingly low similarity between these two main H37Rv PPI datasets. The results and conclusions from this comparative analysis provide valuable guidance in using these M. tuberculosis H37Rv PPI datasets in subsequent studies for a wide range of purposes.
结核分枝杆菌是一种强大的细菌病原体。因此,人们越来越需要了解各种结核分枝杆菌菌株中蛋白质的功能和关系。蛋白质-蛋白质相互作用(PPIs)数据对于这类知识至关重要。然而,主要的可用结核分枝杆菌 PPI 数据集的质量尚不清楚。这阻碍了依赖这些 PPI 数据集的研究工作的有效性。在这里,我们分析了两种主要的结核分枝杆菌 H37Rv PPI 数据集。第一个数据集是 Wang 等人在《蛋白质组研究杂志》上最近发表的论文中的高通量 B2H PPI 数据集。第二个数据集来自 STRING 数据库,版本 8.3,完全由使用各种方法预测的 H37Rv PPIs 组成。我们发现这两个数据集的一致性水平出人意料地低。我们提出了以下导致这种低一致性水平的原因:(i)H37Rv B2H PPI 数据集质量较低;(ii)H37Rv STRING PPI 数据集质量较低;和/或(iii)H37Rv STRING PPIs 是其他形式的功能关联的预测,而不是直接的物理相互作用。
为了测试这两个数据集的质量,我们基于相关基因表达谱、连贯的信息丰富的 GO 术语注释以及在其他生物体中的保守性对它们进行评估。我们观察到 H37Rv STRING PPI 数据集中(得分≥770)具有更多的 PPI 与相互作用伙伴的相关基因表达谱和连贯的信息丰富的 GO 术语注释,而 H37Rv B2H PPI 数据集中则较少。从非结核分枝杆菌实验 PPI 中预测的结核分枝杆菌同源物与 H37Rv STRING 功能关联数据集(得分≥770)更为相似,而与 H37Rv B2H PPI 数据集相比则不相似。从 IntAct 中预测的 H37Rv 物理同源物与 H37Rv B2H PPI 数据集的相似性也极低;并且这种相似性水平远低于从 IntAct 中预测的金黄色葡萄球菌 MRSA252 物理同源物与金黄色葡萄球菌 MRSA252 下拉 PPI 之间的相似性。与其他物种中几个代表性的双杂交 PPI 数据集的比较分析进一步证实了 H37Rv B2H PPI 数据集的质量较低。接下来,为了测试 H37Rv STRING PPIs 不是纯粹的直接物理相互作用的可能性,我们将催化酶反应中相邻步骤的结核分枝杆菌 H37Rv 蛋白对与 B2H PPIs 和 STRING 中的预测 PPIs 进行比较,结果表明与 B2H PPIs 相比,它与 STRING PPIs 的相似性要低得多。这一结果强烈表明,H37Rv STRING PPIs 更可能对应于蛋白对之间的间接关系,而不是 B2H PPIs。为了更精确的支持,我们转向酿酒酵母进行全面研究的相互作用组。我们将酿酒酵母 STRING 中的预测 PPIs 与分别包含 Y2H 测定报告的 PPIs、报告为同一蛋白质复合物的蛋白质对以及催化酶反应中连续反应步骤的蛋白质对的三种独立的蛋白质关系数据集进行比较。我们的分析表明,酿酒酵母预测的 STRING PPIs 与后两种类型的蛋白质对的相似性要高于双杂交 PPIs。由于 H37Rv STRING PPIs 是使用与酿酒酵母预测的 STRING PPIs 相似的方法进行预测的,这表明这些 H37Rv STRING PPIs 更可能对应于后两种类型的蛋白质对,而不是双杂交 PPIs。
H37Rv B2H PPI 数据集质量较低。它不应该被用作评估其他(可能是预测的)H37Rv PPI 数据集质量的金标准。H37Rv STRING PPI 数据集的质量也较低;然而,得分≥770 的 STRING PPI 子集具有令人满意的质量。然而,这些 STRING“PPIs”应被解释为功能关联,其中包括相当一部分间接蛋白质相互作用,而不是直接物理相互作用。这两个因素导致这两个主要的 H37Rv PPI 数据集之间的相似性非常低。从这种比较分析中得出的结果和结论为在随后的各种目的研究中使用这些结核分枝杆菌 H37Rv PPI 数据集提供了有价值的指导。