Bock Joel R, Gough David A
Department of Bioengineering, University of California San Diego, 9500 Gilman Drive, La Jolla 92093-0412, USA.
Bioinformatics. 2003 Jan;19(1):125-34. doi: 10.1093/bioinformatics/19.1.125.
A major post-genomic scientific and technological pursuit is to describe the functions performed by the proteins encoded by the genome. One strategy is to first identify the protein-protein interactions in a proteome, then determine pathways and overall structure relating these interactions, and finally to statistically infer functional roles of individual proteins. Although huge amounts of genomic data are at hand, current experimental protein interaction assays must overcome technical problems to scale-up for high-throughput analysis. In the meantime, bioinformatics approaches may help bridge the information gap required for inference of protein function. In this paper, a previously described data mining approach to prediction of protein-protein interactions (Bock and Gough, 2001, Bioinformatics, 17, 455-460) is extended to interaction mining on a proteome-wide scale. An algorithm (the phylogenetic bootstrap) is introduced, which suggests traversal of a phenogram, interleaving rounds of computation and experiment, to develop a knowledge base of protein interactions in genetically-similar organisms.
The interaction mining approach was demonstrated by building a learning system based on 1,039 experimentally validated protein-protein interactions in the human gastric bacterium Helicobacter pylori. An estimate of the generalization performance of the classifier was derived from 10-fold cross-validation, which indicated expected upper bounds on precision of 80% and sensitivity of 69% when applied to related organisms. One such organism is the enteric pathogen Campylobacter jejuni, in which comprehensive machine learning prediction of all possible pairwise protein-protein interactions was performed. The resulting network of interactions shares an average protein connectivity characteristic in common with previous investigations reported in the literature, offering strong evidence supporting the biological feasibility of the hypothesized map. For inferences about complete proteomes in which the number of pairwise non-interactions is expected to be much larger than the number of actual interactions, we anticipate that the sensitivity will remain the same but precision may decrease. We present specific biological examples of two subnetworks of protein-protein interactions in C. jejuni resulting from the application of this approach, including elements of a two-component signal transduction systems for thermoregulation, and a ferritin uptake network.
后基因组时代的一项主要科技追求是描述基因组所编码蛋白质的功能。一种策略是首先识别蛋白质组中的蛋白质 - 蛋白质相互作用,然后确定与这些相互作用相关的途径和整体结构,最后从统计学上推断单个蛋白质的功能作用。尽管手头有大量的基因组数据,但当前的实验性蛋白质相互作用检测方法必须克服技术问题才能扩大规模以进行高通量分析。与此同时,生物信息学方法可能有助于弥补推断蛋白质功能所需的信息差距。在本文中,一种先前描述的用于预测蛋白质 - 蛋白质相互作用的数据挖掘方法(Bock和Gough,2001年,《生物信息学》,17卷,455 - 460页)被扩展到全蛋白质组规模的相互作用挖掘。引入了一种算法(系统发育自展法),该算法建议遍历系统发育树图,交错进行多轮计算和实验,以建立遗传相似生物体中蛋白质相互作用的知识库。
通过基于人胃细菌幽门螺杆菌中1039个经实验验证的蛋白质 - 蛋白质相互作用构建学习系统,展示了相互作用挖掘方法。分类器泛化性能的估计来自10折交叉验证,这表明当应用于相关生物体时,预期精度上限为80%,灵敏度为69%。一种这样的生物体是肠道病原体空肠弯曲菌,其中对所有可能的成对蛋白质 - 蛋白质相互作用进行了全面的机器学习预测。所得的相互作用网络与文献中先前报道的研究具有共同的平均蛋白质连接特征,为所假设图谱的生物学可行性提供了有力证据。对于关于完整蛋白质组的推断,预计成对非相互作用的数量将远大于实际相互作用的数量,我们预计灵敏度将保持不变,但精度可能会降低。我们展示了应用此方法在空肠弯曲菌中产生的两个蛋白质 - 蛋白质相互作用子网的具体生物学实例,包括用于温度调节的双组分信号转导系统的元件以及铁蛋白摄取网络。