Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden.
European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 1SD , United Kingdom.
J Proteome Res. 2018 May 4;17(5):1879-1886. doi: 10.1021/acs.jproteome.7b00899. Epub 2018 Apr 16.
A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.
一种评估分析实验设置性能的自然方法是使用已知成分的样品,并观察在多大程度上可以从数据中正确推断出样品的含量。对于鸟枪法蛋白质组学,解释数据的一个固有问题是,测量的分析物是肽,而不是实际的蛋白质本身。由于一些蛋白质具有共同的酶解肽,可能有多个可能的因果蛋白组导致给定的肽集,并且需要从检测到的肽列表中推断出蛋白质的机制。商业上可用的已知内容样本的一个弱点是,它们由故意选择产生独特于单个蛋白质的酶切肽的蛋白质组成。不幸的是,此类样品不会暴露蛋白质推断中的任何复杂情况。因此,对于蛋白质推断程序的实际基准测试,需要具有已知内容的样品,其中现有蛋白质与已知不存在的蛋白质共享肽。在这里,我们提出了这样一个标准,它基于表达人蛋白片段的大肠杆菌。为了说明该标准的应用,我们根据数据对一组不同的蛋白质推断程序进行了基准测试。我们观察到,排除共享肽的推断程序与包括共享肽信息的方法相比,提供了更准确的错误估计,同时在鉴定的蛋白质数量方面仍具有合理的性能。我们还证明,使用没有共享酶切肽的已知蛋白质含量的样品会使许多蛋白质推断方法产生错误的准确性感觉。