Li Honglan, Park Jonghun, Kim Hyunwoo, Hwang Kyu-Baek, Paek Eunok
School of Computer Science and Engineering, Soongsil University , Seoul 06978, Republic of Korea.
Department of Computer Science, Hanyang University , Seoul 04763, Republic of Korea.
J Proteome Res. 2017 Jun 2;16(6):2231-2239. doi: 10.1021/acs.jproteome.7b00033. Epub 2017 May 5.
Proteogenomic searches are useful for novel peptide identification from tandem mass spectra. Usually, separate and multistage approaches are adopted to accurately control the false discovery rate (FDR) for proteogenomic search. Their performance on novel peptide identification has not been thoroughly evaluated, however, mainly due to the difficulty in confirming the existence of identified novel peptides. We simulated a proteogenomic search using a controlled, spike-in proteomic data set. After confirming that the results of the simulated proteogenomic search were similar to those of a real proteogenomic search using a human cell line data set, we evaluated the performance of six FDR control methods-global, separate, and multistage FDR estimation, respectively, coupled to a target-decoy search and a mixture model-based method-on novel peptide identification. The multistage approach showed the highest accuracy for FDR estimation. However, global and separate FDR estimation with the mixture model-based method showed higher sensitivities than others at the same true FDR. Furthermore, the mixture model-based method performed equally well when applied without or with a reduced set of decoy sequences. Considering different prior probabilities for novel and known protein identification, we recommend using mixture model-based methods with separate FDR estimation for sensitive and reliable identification of novel peptides from proteogenomic searches.
蛋白质基因组搜索对于从串联质谱中鉴定新型肽段很有用。通常,采用单独和多阶段方法来准确控制蛋白质基因组搜索的错误发现率(FDR)。然而,它们在新型肽段鉴定方面的性能尚未得到充分评估,主要是因为难以确认已鉴定新型肽段的存在。我们使用一个经过控制的、掺入的蛋白质组数据集模拟了蛋白质基因组搜索。在确认模拟蛋白质基因组搜索的结果与使用人类细胞系数据集的真实蛋白质基因组搜索结果相似后,我们评估了六种FDR控制方法——全局、单独和多阶段FDR估计,分别与目标诱饵搜索和基于混合模型的方法相结合——在新型肽段鉴定方面的性能。多阶段方法在FDR估计方面显示出最高的准确性。然而,基于混合模型的方法进行全局和单独FDR估计时,在相同的真实FDR下比其他方法具有更高的灵敏度。此外,基于混合模型的方法在不使用或使用减少的诱饵序列集时表现同样良好。考虑到新型和已知蛋白质鉴定的不同先验概率,我们建议使用基于混合模型的方法并进行单独FDR估计,以便从蛋白质基因组搜索中灵敏且可靠地鉴定新型肽段。