Suppr超能文献

利用大规模下一代测序数据鉴定异常癌症肽段的蛋白质基因组学策略。

Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.

作者信息

Woo Sunghee, Cha Seong Won, Na Seungjin, Guest Clark, Liu Tao, Smith Richard D, Rodland Karin D, Payne Samuel, Bafna Vineet

机构信息

Department of Electrical and Computer Engineering, University of California, San Diego, CA, USA.

出版信息

Proteomics. 2014 Dec;14(23-24):2719-30. doi: 10.1002/pmic.201400206. Epub 2014 Nov 17.

Abstract

Cancer is driven by the acquisition of somatic DNA lesions. Distinguishing the early driver mutations from subsequent passenger mutations is key to molecular subtyping of cancers, understanding cancer progression, and the discovery of novel biomarkers. The advances of genomics technologies (whole-genome exome, and transcript sequencing, collectively referred to as NGS (next-generation sequencing)) have fueled recent studies on somatic mutation discovery. However, the vision is challenged by the complexity, redundancy, and errors in genomic data, and the difficulty of investigating the proteome translated portion of aberrant genes using only genomic approaches. Combination of proteomic and genomic technologies are increasingly being employed. Various strategies have been employed to allow the usage of large-scale NGS data for conventional MS/MS searches. This paper provides a discussion of applying different strategies relating to large database search, and FDR (false discovery rate) -based error control, and their implication to cancer proteogenomics. Moreover, it extends and develops the idea of a unified genomic variant database that can be searched by any MS sample. A total of 879 BAM files downloaded from TCGA repository were used to create a 4.34 GB unified FASTA database that contained 2787062 novel splice junctions, 38,464 deletions, 1,105 insertions, and 182,302 substitutions. Proteomic data from a single ovarian carcinoma sample (439,858 spectra) was searched against the database. By applying the most conservative FDR measure, we have identified 524 novel peptides and 65,578 known peptides at 1% FDR threshold. The novel peptides include interesting examples of doubly mutated peptides, frame-shifts, and nonsample-recruited mutations, which emphasize the strength of our approach.

摘要

癌症是由体细胞DNA损伤的获得所驱动的。区分早期驱动突变与随后的乘客突变是癌症分子亚型分类、理解癌症进展以及发现新型生物标志物的关键。基因组技术(全基因组外显子组和转录组测序,统称为NGS(下一代测序))的进展推动了近期关于体细胞突变发现的研究。然而,这一愿景受到基因组数据的复杂性、冗余性和错误以及仅使用基因组方法研究异常基因的蛋白质组翻译部分的困难的挑战。蛋白质组学和基因组技术的结合越来越多地被采用。已经采用了各种策略来允许将大规模NGS数据用于传统的MS/MS搜索。本文讨论了应用与大型数据库搜索以及基于错误发现率(FDR)的错误控制相关的不同策略,以及它们对癌症蛋白质基因组学的意义。此外,它扩展并发展了一个统一的基因组变异数据库的概念,该数据库可以被任何MS样本搜索。从TCGA库下载的总共879个BAM文件被用于创建一个4.34GB的统一FASTA数据库,该数据库包含2787062个新的剪接接头、38464个缺失、1105个插入和182302个替换。针对该数据库搜索了来自单个卵巢癌样本的蛋白质组数据(439858个质谱图)。通过应用最保守的FDR测量方法,我们在1% FDR阈值下鉴定出了524个新肽段和65578个已知肽段。新肽段包括双突变肽段、移码突变和非样本招募突变等有趣的例子,这强调了我们方法的优势。

相似文献

1
Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data.
Proteomics. 2014 Dec;14(23-24):2719-30. doi: 10.1002/pmic.201400206. Epub 2014 Nov 17.
2
Proteogenomics: From next-generation sequencing (NGS) and mass spectrometry-based proteomics to precision medicine.
Clin Chim Acta. 2019 Nov;498:38-46. doi: 10.1016/j.cca.2019.08.010. Epub 2019 Aug 14.
4
False discovery rate: the Achilles' heel of proteogenomics.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac163.
6
Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate.
J Proteome Res. 2016 Nov 4;15(11):4082-4090. doi: 10.1021/acs.jproteome.6b00376. Epub 2016 Aug 30.
7
Current status of clinical proteogenomics in lung cancer.
Expert Rev Proteomics. 2019 Sep;16(9):761-772. doi: 10.1080/14789450.2019.1654861. Epub 2019 Aug 21.
9
Integrated proteo-genomic approach for early diagnosis and prognosis of cancer.
Cancer Lett. 2015 Dec 1;369(1):28-36. doi: 10.1016/j.canlet.2015.08.003. Epub 2015 Aug 11.

引用本文的文献

1
Identification of non-canonical peptides with moPepGen.
Nat Biotechnol. 2025 Jun 16. doi: 10.1038/s41587-025-02701-0.
2
Chemoproteogenomic stratification of the missense variant cysteinome.
Nat Commun. 2024 Oct 28;15(1):9284. doi: 10.1038/s41467-024-53520-x.
3
moPepGen: Rapid and Comprehensive Identification of Non-canonical Peptides.
bioRxiv. 2024 Nov 5:2024.03.28.587261. doi: 10.1101/2024.03.28.587261.
4
PgxSAVy: A tool for comprehensive evaluation of variant peptide quality in proteogenomics - catching the (un)usual suspects.
Comput Struct Biotechnol J. 2023 Dec 26;23:711-722. doi: 10.1016/j.csbj.2023.12.033. eCollection 2024 Dec.
5
Multi-omic stratification of the missense variant cysteinome.
bioRxiv. 2023 Aug 14:2023.08.12.553095. doi: 10.1101/2023.08.12.553095.
6
Protein-gene Expression Nexus: Comprehensive characterization of human cancer cell lines with proteogenomic analysis.
Comput Struct Biotechnol J. 2021 Aug 17;19:4759-4769. doi: 10.1016/j.csbj.2021.08.022. eCollection 2021.
7
[Research progress and application of retention time prediction method based on deep learning].
Se Pu. 2021 Mar;39(3):211-218. doi: 10.3724/SP.J.1123.2020.08015.
8
MutCombinator: identification of mutated peptides allowing combinatorial mutations using nucleotide-based graph search.
Bioinformatics. 2020 Jul 1;36(Suppl_1):i203-i209. doi: 10.1093/bioinformatics/btaa504.
9
Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis.
Nat Commun. 2020 Apr 9;11(1):1759. doi: 10.1038/s41467-020-15456-w.

本文引用的文献

1
Correlation of mRNA and protein abundance in the developing maize leaf.
Plant J. 2014 May;78(3):424-40. doi: 10.1111/tpj.12482. Epub 2014 Apr 2.
2
An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays.
Mol Cell Proteomics. 2014 Jan;13(1):157-67. doi: 10.1074/mcp.M113.031260. Epub 2013 Oct 18.
3
customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search.
Bioinformatics. 2013 Dec 15;29(24):3235-7. doi: 10.1093/bioinformatics/btt543. Epub 2013 Sep 20.
4
Proteogenomic database construction driven from large scale RNA-seq data.
J Proteome Res. 2014 Jan 3;13(1):21-8. doi: 10.1021/pr400294c. Epub 2013 Jul 17.
6
Ensembl 2013.
Nucleic Acids Res. 2013 Jan;41(Database issue):D48-55. doi: 10.1093/nar/gks1236. Epub 2012 Nov 30.
7
De novo derivation of proteomes from transcriptomes for transcript and protein identification.
Nat Methods. 2012 Dec;9(12):1207-11. doi: 10.1038/nmeth.2227. Epub 2012 Nov 11.
8
Comprehensive molecular portraits of human breast tumours.
Nature. 2012 Oct 4;490(7418):61-70. doi: 10.1038/nature11412. Epub 2012 Sep 23.
9
Proteogenomic analysis of bacteria and archaea: a 46 organism case study.
PLoS One. 2011;6(11):e27587. doi: 10.1371/journal.pone.0027587. Epub 2011 Nov 17.
10
Protein identification using customized protein sequence databases derived from RNA-Seq data.
J Proteome Res. 2012 Feb 3;11(2):1009-17. doi: 10.1021/pr200766z. Epub 2011 Dec 14.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验