整合蛋白质组学数据集的蛋白质组覆盖预测

Proteome coverage prediction for integrated proteomics datasets.

作者信息

Claassen Manfred, Aebersold Ruedi, Buhmann Joachim M

机构信息

Department of Computer Science, ETH Zurich, Zurich, Switzerland.

出版信息

J Comput Biol. 2011 Mar;18(3):283-93. doi: 10.1089/cmb.2010.0261.

DOI:10.1089/cmb.2010.0261

PMID:21385034

Abstract

Comprehensive characterization of a proteome defines a fundamental goal in proteomics. In order to maximize proteome coverage for a complex protein mixture, i.e., to identify as many proteins as possible, various different fractionation experiments are typically performed and the individual fractions are subjected to mass spectrometric analysis. The resulting data are integrated into large and heterogeneous datasets. Proteome coverage prediction refers to the task of extrapolating the number of protein discoveries by future measurements conditioned on a sequence of already performed measurements. Proteome coverage prediction at an early stage enables experimentalists to design and plan efficient proteomics studies. To date, there does not exist any method that reliably predicts proteome coverage from integrated datasets. We present a generalized hierarchical Pitman-Yor process model that explicitly captures the redundancy within integrated datasets. The accuracy of our approach for proteome coverage prediction is assessed by applying it to an integrated proteomics dataset for the bacterium L. interrogans. The proposed procedure outperforms ad hoc extrapolation methods and prediction methods designed for non-integrated datasets. Furthermore, the maximally achievable proteome coverage is estimated for the experimental setup underlying the L. interrogans dataset. We discuss the implications of our results for determining rational stop criteria and their influence on the design of efficient and reliable proteomics studies.

摘要

蛋白质组的全面表征是蛋白质组学的一个基本目标。为了最大限度地提高复杂蛋白质混合物的蛋白质组覆盖率，即尽可能多地鉴定蛋白质，通常会进行各种不同的分级实验，并对各个级分进行质谱分析。所得数据被整合到大型且异质的数据集中。蛋白质组覆盖率预测是指根据一系列已进行的测量来推断未来测量中蛋白质发现数量的任务。早期的蛋白质组覆盖率预测使实验人员能够设计和规划高效的蛋白质组学研究。迄今为止，还不存在任何能从整合数据集中可靠预测蛋白质组覆盖率的方法。我们提出了一种广义分层皮特曼 - 约尔过程模型，该模型明确捕捉了整合数据集中的冗余信息。通过将我们的方法应用于问号钩端螺旋体的整合蛋白质组数据集，评估了我们用于蛋白质组覆盖率预测方法的准确性。所提出的程序优于专为非整合数据集设计的临时外推方法和预测方法。此外，还针对问号钩端螺旋体数据集所依据的实验设置估计了可实现的最大蛋白质组覆盖率。我们讨论了我们的结果对于确定合理的停止标准及其对高效可靠蛋白质组学研究设计的影响。