基于大规模RNA测序数据驱动的蛋白质基因组数据库构建。

Proteogenomic database construction driven from large scale RNA-seq data.

作者信息

Woo Sunghee, Cha Seong Won, Merrihew Gennifer, He Yupeng, Castellana Natalie, Guest Clark, MacCoss Michael, Bafna Vineet

机构信息

Department of Electrical and Computing Engineering, ¶Department of Bioinformatics and Systems Biology, and §Department of Computer Science, University of California, San Diego , La Jolla, California 92093, United States.

出版信息

J Proteome Res. 2014 Jan 3;13(1):21-8. doi: 10.1021/pr400294c. Epub 2013 Jul 17.

DOI:10.1021/pr400294c

PMID:23802565

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4034692/

Abstract

The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496.2 GB of aligned RNA-seq SAM files to 410 MB of splice graph database written in FASTA format. This corresponds to 1000× compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom data set, using a completely automated pipeline, and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame shifts, 1166 reverse strands, and 42 translated UTRs. Our results highlight the usefulness of transcript + proteomic integration for improved genome annotations.

摘要

廉价的RNA测序技术及其他用于RNA的深度测序技术的出现，有望从根本上改善基因组注释，提供多种细胞条件下转录区域和剪接事件的信息。利用基于质谱的蛋白质基因组学，其中许多事件可以在蛋白质水平直接得到证实。然而，整合大量冗余的RNA测序数据和质谱数据带来了一个具有挑战性的问题。我们的论文通过构建一个包含RNA测序读段中所有有用信息的紧凑数据库来解决这个问题。将我们的方法应用于累积的秀丽隐杆线虫数据，把496.2GB的比对RNA测序SAM文件压缩到了410MB以FASTA格式编写的剪接图数据库。这相当于数据大小压缩了1000倍，且不损失灵敏度。我们使用自定义数据集，通过一个完全自动化的流程进行了一项蛋白质基因组学研究，总共鉴定出4044个新事件，包括215个新基因、808个新外显子、12个可变剪接、618个基因边界校正、245个外显子边界变化、938个移码、1166个反向链和42个翻译后的非翻译区。我们的结果突出了转录本+蛋白质组学整合对于改进基因组注释的有用性。

相似文献

Proteogenomic database construction driven from large scale RNA-seq data.

J Proteome Res. 2014 Jan 3;13(1):21-8. doi: 10.1021/pr400294c. Epub 2013 Jul 17.

Proteomics in non-human primates: utilizing RNA-Seq data to improve protein identification by mass spectrometry in vervet monkeys.

BMC Genomics. 2017 Nov 13;18(1):877. doi: 10.1186/s12864-017-4279-0.

NextSearch: A Search Engine for Mass Spectrometry Data against a Compact Nucleotide Exon Graph.

J Proteome Res. 2015 Jul 2;14(7):2784-91. doi: 10.1021/acs.jproteome.5b00047. Epub 2015 Jun 9.

Identification of novel alternative splicing biomarkers for breast cancer with LC/MS/MS and RNA-Seq.

BMC Bioinformatics. 2020 Dec 3;21(Suppl 9):541. doi: 10.1186/s12859-020-03824-8.

C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression.

Nat Genet. 2003 May;34(1):35-41. doi: 10.1038/ng1140.

Quantitative RNA-seq meta-analysis of alternative exon usage in .

Genome Res. 2017 Dec;27(12):2120-2128. doi: 10.1101/gr.224626.117. Epub 2017 Oct 31.

Improving Gene Annotation of the Peanut Genome by Integrated Proteogenomics Workflow.

J Proteome Res. 2020 Jun 5;19(6):2226-2235. doi: 10.1021/acs.jproteome.9b00723. Epub 2020 May 15.

Identification of new protein coding sequences and signal peptidase cleavage sites of Helicobacter pylori strain 26695 by proteogenomics.

J Proteomics. 2013 Jun 28;86:27-42. doi: 10.1016/j.jprot.2013.04.036. Epub 2013 May 9.

Identification of Differentially Expressed Splice Variants by the Proteogenomic Pipeline Splicify.

Mol Cell Proteomics. 2017 Oct;16(10):1850-1863. doi: 10.1074/mcp.TIR117.000056. Epub 2017 Jul 26.

Improving Silkworm Genome Annotation Using a Proteogenomics Approach.

J Proteome Res. 2019 Aug 2;18(8):3009-3019. doi: 10.1021/acs.jproteome.8b00965. Epub 2019 Jul 2.

引用本文的文献

Identification of non-canonical peptides with moPepGen.

Nat Biotechnol. 2025 Jun 16. doi: 10.1038/s41587-025-02701-0.

Evaluation of Eukaryotic mRNA Coding Potential.

Methods Mol Biol. 2025;2859:319-331. doi: 10.1007/978-1-0716-4152-1_18.

Moving Toward Metaproteogenomics: A Computational Perspective on Analyzing Microbial Samples via Proteogenomics.

Methods Mol Biol. 2025;2859:297-318. doi: 10.1007/978-1-0716-4152-1_17.

Proteogenomic Approaches for Diseasome Studies.

Methods Mol Biol. 2025;2859:253-264. doi: 10.1007/978-1-0716-4152-1_14.

moPepGen: Rapid and Comprehensive Identification of Non-canonical Peptides.

bioRxiv. 2024 Nov 5:2024.03.28.587261. doi: 10.1101/2024.03.28.587261.

Identification of Novel Genes and Proteoforms in through a Proteogenomic Approach.

Pathogens. 2022 Oct 31;11(11):1273. doi: 10.3390/pathogens11111273.

Micropeptides translated from putative long non-coding RNAs.

Acta Biochim Biophys Sin (Shanghai). 2022 Mar 25;54(3):292-300. doi: 10.3724/abbs.2022010.

Large-scale discovery of non-conventional peptides in grape ( L.) through peptidogenomics.

Hortic Res. 2022 May 2;9:uhac023. doi: 10.1093/hr/uhac023. eCollection 2022.

A Practical Guide to Small Protein Discovery and Characterization Using Mass Spectrometry.

J Bacteriol. 2022 Jan 18;204(1):e0035321. doi: 10.1128/JB.00353-21. Epub 2021 Nov 8.

Proteoform Identification by Combining RNA-Seq and Top-Down Mass Spectrometry.

J Proteome Res. 2021 Jan 1;20(1):261-269. doi: 10.1021/acs.jproteome.0c00369. Epub 2020 Nov 12.

本文引用的文献

Protein identification using customized protein sequence databases derived from RNA-Seq data.

J Proteome Res. 2012 Feb 3;11(2):1009-17. doi: 10.1021/pr200766z. Epub 2011 Dec 14.

Regulation of Caenorhabditis elegans vitellogenesis by DAF-2/IIS through separable transcriptional and posttranscriptional mechanisms.

BMC Physiol. 2011 Jul 12;11:11. doi: 10.1186/1472-6793-11-11.

Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Nat Biotechnol. 2011 May 15;29(7):644-52. doi: 10.1038/nbt.1883.

A bioinformatics workflow for variant peptide detection in shotgun proteomics.

Mol Cell Proteomics. 2011 May;10(5):M110.006536. doi: 10.1074/mcp.M110.006536. Epub 2011 Mar 9.

Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.

Science. 2010 Dec 24;330(6012):1775-87. doi: 10.1126/science.1196914. Epub 2010 Dec 22.

The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search.

Mol Cell Proteomics. 2010 Dec;9(12):2840-52. doi: 10.1074/mcp.M110.003731. Epub 2010 Sep 9.

Proteogenomics to discover the full coding content of genomes: a computational perspective.

J Proteomics. 2010 Oct 10;73(11):2124-35. doi: 10.1016/j.jprot.2010.06.007. Epub 2010 Jul 8.

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Nat Biotechnol. 2010 May;28(5):511-5. doi: 10.1038/nbt.1621. Epub 2010 May 2.

The 1000 Genomes Project: new opportunities for research and social challenges.

Genome Med. 2010 Jan 21;2(1):3. doi: 10.1186/gm124.

Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1.

Cancer Cell. 2010 Jan 19;17(1):98-110. doi: 10.1016/j.ccr.2009.12.020.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

基于大规模RNA测序数据驱动的蛋白质基因组数据库构建。

Proteogenomic database construction driven from large scale RNA-seq data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr超能文献

基于大规模RNA测序数据驱动的蛋白质基因组数据库构建。

Proteogenomic database construction driven from large scale RNA-seq data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr
超能文献