UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia.
UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia.
Clin Chim Acta. 2019 Nov;498:38-46. doi: 10.1016/j.cca.2019.08.010. Epub 2019 Aug 14.
One of the best-established area within multi-omics is proteogenomics, whereby the underpinning technologies are next-generation sequencing (NGS) and mass spectrometry (MS). Proteogenomics has contributed significantly to genome (re)-annotation, whereby novel coding sequences (CDS) are identified and confirmed. By incorporating in-silico translated genome variants in protein database, single amino acid variants (SAAV) and splice proteoforms can be identified and quantified at peptide level. The application of proteogenomics in cancer research potentially enables the identification of patient-specific proteoforms, as well as the association of the efficacy or resistance of cancer therapy to different mutations. Here, we discuss how NGS/TGS data are analyzed and incorporated into the proteogenomic framework. These sequence data mainly originate from whole genome sequencing (WGS), whole exome sequencing (WES) and RNA-Seq. We explain two major strategies for sequence analysis i.e., de novo assembly and reads mapping, followed by construction of customized protein databases using such data. Besides, we also elaborate on the procedures of spectrum to peptide sequence matching in proteogenomics, and the relationship between database size on the false discovery rate (FDR). Finally, we discuss the latest development in proteogenomics-assisted precision oncology and also challenges and opportunities in proteogenomics research.
在多组学中,最成熟的领域之一是蛋白质基因组学,其基础技术是下一代测序(NGS)和质谱(MS)。蛋白质基因组学为基因组(重新)注释做出了重大贡献,其中鉴定和确认了新的编码序列(CDS)。通过将计算机翻译的基因组变体纳入蛋白质数据库中,可以在肽水平上鉴定和定量单氨基酸变体(SAAV)和剪接蛋白变体。蛋白质基因组学在癌症研究中的应用有可能识别患者特异性的蛋白变体,并将癌症治疗的疗效或耐药性与不同的突变相关联。在这里,我们讨论了如何分析 NGS/TGS 数据并将其纳入蛋白质基因组学框架。这些序列数据主要来自全基因组测序(WGS)、外显子组测序(WES)和 RNA-Seq。我们解释了两种主要的序列分析策略,即从头组装和读段映射,然后使用这些数据构建定制的蛋白质数据库。此外,我们还详细介绍了蛋白质基因组学中谱图到肽序列匹配的过程,以及数据库大小与错误发现率(FDR)之间的关系。最后,我们讨论了蛋白质基因组学辅助精准肿瘤学的最新进展,以及蛋白质基因组学研究中的挑战和机遇。