Li Honglan, Joh Yoon Sung, Kim Hyunwoo, Paek Eunok, Lee Sang-Won, Hwang Kyu-Baek
School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Republic of Korea.
Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea.
BMC Genomics. 2016 Dec 22;17(Suppl 13):1031. doi: 10.1186/s12864-016-3327-5.
Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available.
To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches.
We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.
蛋白质基因组学是一种很有前景的方法,可用于从基因注释到癌症研究等各种任务。蛋白质基因组学搜索数据库通常通过将从基因组或转录组证据推断出的肽序列添加到参考蛋白质序列中来构建。这种数据库的扩充有识别新肽的潜力。然而,这也引发了对敏感且可靠的肽鉴定的担忧。目标数据库中包含的假肽可能导致错误发现率(FDR)被低估。另一方面,诱饵数据库的扩充可能会由于高分随机匹配数量的增加而降低肽鉴定的灵敏度。尽管有几项研究已经解决了这些问题,但几乎没有适用于蛋白质基因组学敏感且可靠搜索的广泛指南。
为了系统评估数据库扩充在蛋白质基因组学搜索中的影响,我们分别为酵母和人类串联质谱(MS/MS)数据构建了各种真实和模拟的蛋白质基因组学数据库。针对这些数据库,我们使用两种流行的数据库搜索工具,并采用各种搜索结果验证方法进行测试:目标-诱饵搜索策略(有和没有改进的评分指标)以及基于混合模型的方法。还检查了对已知肽和新肽进行单独过滤的效果。真实和模拟蛋白质基因组学搜索的结果证实,单独过滤可提高蛋白质基因组学搜索的灵敏度和可靠性。然而,在真实的蛋白质基因组学搜索中,没有一种方法始终能鉴定出最多(或最少)数量的新肽。
我们建议使用一组带有单独过滤的搜索结果验证方法,以在蛋白质基因组学搜索中敏感且可靠地鉴定肽。