Molecular Pharmacology Program , Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center , New York City , New York 10065 , United States.
Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology , New York University Langone Health , New York City , New York 10016 , United States.
J Proteome Res. 2018 Nov 2;17(11):3681-3692. doi: 10.1021/acs.jproteome.8b00295. Epub 2018 Oct 19.
Modern mass spectrometry now permits genome-scale and quantitative measurements of biological proteomes. However, analysis of specific specimens is currently hindered by the incomplete representation of biological variability of protein sequences in canonical reference proteomes and the technical demands for their construction. Here, we report ProteomeGenerator, a framework for de novo and reference-assisted proteogenomic database construction and analysis based on sample-specific transcriptome sequencing and high-accuracy mass spectrometry proteomics. This enables the assembly of proteomes encoded by actively transcribed genes, including sample-specific protein isoforms resulting from non-canonical mRNA transcription, splicing, or editing. To improve the accuracy of protein isoform identification in non-canonical proteomes, ProteomeGenerator relies on statistical target-decoy database matching calibrated using sample-specific controls. Its current implementation includes automatic integration with MaxQuant mass spectrometry proteomics algorithms. We applied this method for the proteogenomic analysis of splicing factor SRSF2 mutant leukemia cells, demonstrating high-confidence identification of non-canonical protein isoforms arising from alternative transcriptional start sites, intron retention, and cryptic exon splicing as well as improved accuracy of genome-scale proteome discovery. Additionally, we report proteogenomic performance metrics for current state-of-the-art implementations of SEQUEST HT, MaxQuant, Byonic, and PEAKS mass spectral analysis algorithms. Finally, ProteomeGenerator is implemented as a Snakemake workflow within a Singularity container for one-step installation in diverse computing environments, thereby enabling open, scalable, and facile discovery of sample-specific, non-canonical, and neomorphic biological proteomes.
现代质谱技术现在允许对生物蛋白质组进行基因组规模和定量测量。然而,特定样本的分析目前受到以下因素的限制:在规范参考蛋白质组中蛋白质序列的生物变异性的不完全代表性,以及构建这些蛋白质组的技术要求。在这里,我们报告了 ProteomeGenerator,这是一种基于样本特异性转录组测序和高精度质谱蛋白质组学的从头开始和参考辅助蛋白质基因组数据库构建和分析的框架。这使得能够组装由活跃转录基因编码的蛋白质组,包括由于非规范 mRNA 转录、剪接或编辑而产生的样本特异性蛋白质同工型。为了提高非规范蛋白质组中蛋白质同工型鉴定的准确性, ProteomeGenerator 依赖于使用样本特异性对照校准的统计目标诱饵数据库匹配。它的当前实现包括与 MaxQuant 质谱蛋白质组学算法的自动集成。我们将该方法应用于剪接因子 SRSF2 突变白血病细胞的蛋白质基因组分析,证明了从替代转录起始位点、内含子保留和隐蔽外显子剪接产生的非规范蛋白质同工型的高可信度鉴定,以及对基因组规模蛋白质组发现的准确性的提高。此外,我们报告了当前 SEQUEST HT、MaxQuant、Byonic 和 PEAKS 质谱分析算法的最先进实现的蛋白质基因组性能指标。最后, ProteomeGenerator 作为 Singularity 容器中的 Snakemake 工作流实现,可在各种计算环境中一步安装,从而实现开放、可扩展和简便的样本特异性、非规范和新生生物蛋白质组的发现。