Woo Sunghee, Cha Seong Won, Na Seungjin, Guest Clark, Liu Tao, Smith Richard D, Rodland Karin D, Payne Samuel, Bafna Vineet
Department of Electrical and Computer Engineering, University of California, San Diego, CA, USA.
Proteomics. 2014 Dec;14(23-24):2719-30. doi: 10.1002/pmic.201400206. Epub 2014 Nov 17.
Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular subtyping of cancers, understanding cancer progression, and the discovery of novel biomarkers. The advances of genomics technologies (whole-genome exome, and transcript sequencing, collectively referred to as NGS (next-generation sequencing)) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome translated portion of aberrant genes using only genomic approaches. Combination of proteomic and genomic technologies are increasingly being employed. Various strategies have been employed to allow the usage of large-scale NGS data for conventional MS/MS searches. This paper provides a discussion of applying different strategies relating to large database search, and FDR (false discovery rate) -based error control, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any MS sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database that contained 2787062 novel splice junctions, 38,464 deletions, 1,105 insertions, and 182,302 substitutions. Proteomic data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and nonsample-recruited mutations, which emphasize the strength of our approach.
癌症是由体细胞DNA损伤的获得所驱动的。区分早期驱动突变与随后的乘客突变是癌症分子亚型分类、理解癌症进展以及发现新型生物标志物的关键。基因组技术(全基因组外显子组和转录组测序,统称为NGS(下一代测序))的进展推动了近期关于体细胞突变发现的研究。然而,这一愿景受到基因组数据的复杂性、冗余性和错误以及仅使用基因组方法研究异常基因的蛋白质组翻译部分的困难的挑战。蛋白质组学和基因组技术的结合越来越多地被采用。已经采用了各种策略来允许将大规模NGS数据用于传统的MS/MS搜索。本文讨论了应用与大型数据库搜索以及基于错误发现率(FDR)的错误控制相关的不同策略,以及它们对癌症蛋白质基因组学的意义。此外,它扩展并发展了一个统一的基因组变异数据库的概念,该数据库可以被任何MS样本搜索。从TCGA库下载的总共879个BAM文件被用于创建一个4.34GB的统一FASTA数据库,该数据库包含2787062个新的剪接接头、38464个缺失、1105个插入和182302个替换。针对该数据库搜索了来自单个卵巢癌样本的蛋白质组数据(439858个质谱图)。通过应用最保守的FDR测量方法,我们在1% FDR阈值下鉴定出了524个新肽段和65578个已知肽段。新肽段包括双突变肽段、移码突变和非样本招募突变等有趣的例子,这强调了我们方法的优势。