Gilchrist Michael A, Salter Laura A, Wagner Andreas
Department of Biology, University of New Mexico, Albuquerque 87106, USA.
Bioinformatics. 2004 Mar 22;20(5):689-700. doi: 10.1093/bioinformatics/btg469. Epub 2004 Jan 22.
To identify accurately protein function on a proteome-wide scale requires integrating data within and between high-throughput experiments. High-throughput proteomic datasets often have high rates of errors and thus yield incomplete and contradictory information. In this study, we develop a simple statistical framework using Bayes' law to interpret such data and combine information from different high-throughput experiments. In order to illustrate our approach we apply it to two protein complex purification datasets.
Our approach shows how to use high-throughput data to calculate accurately the probability that two proteins are part of the same complex. Importantly, our approach does not need a reference set of verified protein interactions to determine false positive and false negative error rates of protein association. We also demonstrate how to combine information from two separate protein purification datasets into a combined dataset that has greater coverage and accuracy than either dataset alone. In addition, we also provide a technique for estimating the total number of proteins which can be detected using a particular experimental technique.
A suite of simple programs to accomplish some of the above tasks is available at www.unm.edu/~compbio/software/DatasetAssess
要在全蛋白质组范围内准确识别蛋白质功能,需要整合高通量实验内部和之间的数据。高通量蛋白质组数据集往往错误率很高,因此会产生不完整且相互矛盾的信息。在本研究中,我们开发了一个使用贝叶斯定律的简单统计框架来解释此类数据,并整合来自不同高通量实验的信息。为了说明我们的方法,我们将其应用于两个蛋白质复合物纯化数据集。
我们的方法展示了如何利用高通量数据准确计算两种蛋白质属于同一复合物的概率。重要的是,我们的方法不需要一组经过验证的蛋白质相互作用参考集来确定蛋白质关联的假阳性和假阴性错误率。我们还展示了如何将来自两个独立蛋白质纯化数据集的信息整合到一个组合数据集中,该数据集比单独的任何一个数据集都具有更高的覆盖率和准确性。此外,我们还提供了一种技术,用于估计使用特定实验技术可检测到的蛋白质总数。
可在www.unm.edu/~compbio/software/DatasetAssess获取一套用于完成上述一些任务的简单程序。