Aggarwal Suruchi, Raj Anurag, Kumar Dhirendra, Dash Debasis, Yadav Amit Kumar
Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India.
GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac163.
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
蛋白质基因组学是指对基因组和蛋白质组进行综合分析,利用基于质谱(MS)的蛋白质组学数据来改进基因组注释、通过蛋白质异构体理解基因表达调控以及寻找序列变异,从而为疾病分类和治疗策略提供新的见解。然而,由于数据库规模膨胀,蛋白质基因组学研究常常存在灵敏度和特异性降低的问题。为了控制错误率,蛋白质基因组学依赖于目标-诱饵搜索策略,这是蛋白质组学中估计错误发现率(FDR)的实际方法。从三框架或六框架核苷酸数据库翻译构建的蛋白质基因组数据库不仅增加了搜索空间和计算时间,还违反了目标数据库和诱饵数据库的等效性。这些搜索导致目标得分和诱饵得分之间的分离较差,从而导致严格的FDR阈值。了解这些因素并应用改进策略,如两遍数据库搜索或肽类特异性FDR,可以在不引入额外统计偏差的情况下更好地解释MS数据。基于这些考虑,用户可以适当地解释蛋白质基因组学结果,并以更明智的方式控制假阳性和假阴性。在本综述中,首先,我们简要讨论蛋白质基因组学工作流程以及数据库构建中的局限性,随后探讨可能影响蛋白质基因组学研究中潜在新发现的各种因素。我们最后提出应对这些挑战的建议,以更好地解释蛋白质基因组学数据。