Bioinformatics & High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington 98101, USA.
OMICS. 2010 Jun;14(3):309-14. doi: 10.1089/omi.2010.0034.
Large amounts of mass spectrometry (MS) proteomics data are now publicly available; however, little attention has been given to how to best combine these data and assess the error rates for protein identification. The objective of this article is to show how variation in the type and amount of data included with each study impacts coverage of the yeast proteome and estimation of the false discovery rate (FDR). Our analysis of a subset of the publicly available yeast data showed that failure to reevaluate the FDR when combining protein IDs from different experiments resulted in an underestimation of the FDR by approximately threefold. A worst-case approximation of the FDR was only slightly larger than estimating the FDR by randomized database matches. The use of a weighted model to emphasize the most informative experimental data provided an increase in the number of IDs at a 1% FDR when compared to other meta-analysis approaches. Also, using an FDR higher than 1% results in a very high rate of false discoveries for IDs above the 1% threshold. Ideally, raw MS data will be made publicly available for complete and consistent reanalysis. In the circumstance that raw data is not available, determining a combined FDR on the basis of the worst-case estimation provides a reasonable approximation of the FDR. When combining experimental results, adding additional experiments results in diminishing and in some cases negative returns on protein identifications. It may be beneficial to include only those experiments generating the most unique identifications due to solid experimental design and sensitive instrumentation.
现在有大量的质谱(MS)蛋白质组学数据可供公开使用;然而,对于如何最好地结合这些数据并评估蛋白质鉴定的错误率,人们关注甚少。本文的目的是展示每个研究中包含的数据类型和数量的变化如何影响酵母蛋白质组的覆盖率以及假发现率(FDR)的估计。我们对公开可用的酵母数据的一个子集进行了分析,结果表明,如果在组合来自不同实验的蛋白质 ID 时未能重新评估 FDR,将会导致 FDR 的低估约三倍。FDR 的最坏情况近似值仅略大于通过随机数据库匹配来估计 FDR。与其他元分析方法相比,使用加权模型来强调最有信息量的实验数据,在 1% FDR 时可以增加 ID 的数量。此外,当 FDR 高于 1%时,对于高于 1%阈值的 ID,错误发现率会非常高。理想情况下,原始 MS 数据将公开提供,以便进行完整和一致的重新分析。在无法获取原始数据的情况下,基于最坏情况估计确定综合 FDR 是 FDR 的合理近似值。在组合实验结果时,添加额外的实验会导致蛋白质鉴定的回报递减,在某些情况下甚至为负。由于具有可靠的实验设计和灵敏的仪器,仅包含那些生成最多独特鉴定的实验可能会更有益。