Bader Joel S, Chaudhuri Amitabha, Rothberg Jonathan M, Chant John
Department of Biomedical Engineering, 201C Clark Hall, Johns Hopkins University, 3400 N. Charles St., Baltimore, Maryland 21218, USA.
Nat Biotechnol. 2004 Jan;22(1):78-85. doi: 10.1038/nbt924. Epub 2003 Dec 14.
Although genome-scale technologies have benefited from statistical measures of data quality, extracting biologically relevant pathways from high-throughput proteomics data remains a challenge. Here we develop a quantitative method for evaluating proteomics data. We present a logistic regression approach that uses statistical and topological descriptors to predict the biological relevance of protein-protein interactions obtained from high-throughput screens for yeast. Other sources of information, including mRNA expression, genetic interactions and database annotations, are subsequently used to validate the model predictions without bias or cross-pollution. Novel topological statistics show hierarchical organization of the network of high-confidence interactions: protein complex interactions extend one to two links, and genetic interactions represent an even finer scale of organization. Knowledge of the maximum number of links that indicates a significant correlation between protein pairs (correlation distance) enables the integrated analysis of proteomics data with data from genetics and gene expression. The type of analysis presented will be essential for analyzing the growing amount of genomic and proteomics data in model organisms and humans.
尽管基因组规模技术受益于数据质量的统计测量方法,但从高通量蛋白质组学数据中提取生物学相关通路仍然是一项挑战。在此,我们开发了一种评估蛋白质组学数据的定量方法。我们提出了一种逻辑回归方法,该方法使用统计和拓扑描述符来预测从酵母高通量筛选中获得的蛋白质-蛋白质相互作用的生物学相关性。随后,包括mRNA表达、遗传相互作用和数据库注释在内的其他信息来源被用于无偏差或交叉污染地验证模型预测。新的拓扑统计显示了高置信度相互作用网络的层次组织:蛋白质复合物相互作用延伸一到两个链接,而遗传相互作用代表了更精细的组织尺度。了解表明蛋白质对之间显著相关性的最大链接数(相关距离)能够将蛋白质组学数据与遗传学和基因表达数据进行综合分析。所提出的分析类型对于分析模式生物和人类中不断增长的基因组和蛋白质组学数据至关重要。