Moscow Institute of Physics and Technology (State University) , Dolgoprudny 141700 , Moscow Region , Russia.
V.L. Talrose Institute for Energy Problems of Chemical Physics , Russian Academy of Sciences , Moscow 119334 , Russia.
J Proteome Res. 2018 May 4;17(5):1801-1811. doi: 10.1021/acs.jproteome.7b00841. Epub 2018 Apr 16.
The identification of genetically encoded variants at the proteome level is an important problem in cancer proteogenomics. The generation of customized protein databases from DNA or RNA sequencing data is a crucial stage of the identification workflow. Genomic data filtering applied at this stage may significantly modify variant search results, yet its effect is generally left out of the scope of proteogenomic studies. In this work, we focused on this impact using data of exome sequencing and LC-MS/MS analyses of six replicates for eight melanoma cell lines processed by a proteogenomics workflow. The main objectives were identifying variant peptides and revealing the role of the genomic data filtering in the variant identification. A series of six confidence thresholds for single nucleotide polymorphisms and indels from the exome data were applied to generate customized sequence databases of different stringency. In the searches against unfiltered databases, between 100 and 160 variant peptides were identified for each of the cell lines using X!Tandem and MS-GF+ search engines. The recovery rate for variant peptides was ∼1%, which is approximately three times lower than that of the wild-type peptides. Using unfiltered genomic databases for variant searches resulted in higher sensitivity and selectivity of the proteogenomic workflow and positively affected the ability to distinguish the cell lines based on variant peptide signatures.
在癌症蛋白质组学中,鉴定蛋白质组水平上的遗传编码变异是一个重要的问题。从 DNA 或 RNA 测序数据生成定制的蛋白质数据库是鉴定工作流程的关键阶段。在该阶段应用的基因组数据过滤可能会显著改变变异搜索结果,但它的影响通常不在蛋白质组学研究的范围内。在这项工作中,我们使用来自六个黑色素瘤细胞系的外显子测序和 LC-MS/MS 分析数据,针对这一影响进行了重点研究,这些细胞系通过蛋白质组学工作流程进行了处理。主要目标是鉴定变异肽并揭示基因组数据过滤在变异鉴定中的作用。我们应用了外显子数据中单核苷酸多态性和插入缺失的六个置信度阈值,以生成不同严格程度的定制序列数据库。在针对未过滤数据库的搜索中,使用 X!Tandem 和 MS-GF+搜索引擎,每个细胞系都鉴定出了 100 到 160 个变异肽。变异肽的回收率约为 1%,大约比野生型肽低三倍。使用未过滤的基因组数据库进行变异搜索提高了蛋白质组学工作流程的灵敏度和选择性,并有助于根据变异肽特征区分细胞系。