Department of Chemistry, University of Wisconsin-Madison, 1101 University Ave., Madison, Wisconsin 53706, USA.
Mol Cell Proteomics. 2013 Aug;12(8):2341-53. doi: 10.1074/mcp.O113.028142. Epub 2013 Apr 29.
Human proteomic databases required for MS peptide identification are frequently updated and carefully curated, yet are still incomplete because it has been challenging to acquire every protein sequence from the diverse assemblage of proteoforms expressed in every tissue and cell type. In particular, alternative splicing has been shown to be a major source of this cell-specific proteomic variation. Many new alternative splice forms have been detected at the transcript level using next generation sequencing methods, especially RNA-Seq, but it is not known how many of these transcripts are being translated. Leveraging the unprecedented capabilities of next generation sequencing methods, we collected RNA-Seq and proteomics data from the same cell population (Jurkat cells) and created a bioinformatics pipeline that builds customized databases for the discovery of novel splice-junction peptides. Eighty million paired-end Illumina reads and ∼500,000 tandem mass spectra were used to identify 12,873 transcripts (19,320 including isoforms) and 6810 proteins. We developed a bioinformatics workflow to retrieve high-confidence, novel splice junction sequences from the RNA data, translate these sequences into the analogous polypeptide sequence, and create a customized splice junction database for MS searching. Based on the RefSeq gene models, we detected 136,123 annotated and 144,818 unannotated transcript junctions. Of those, 24,834 unannotated junctions passed various quality filters (e.g. minimum read depth) and these entries were translated into 33,589 polypeptide sequences and used for database searching. We discovered 57 splice junction peptides not present in the Uniprot-Trembl proteomic database comprising an array of different splicing events, including skipped exons, alternative donors and acceptors, and noncanonical transcriptional start sites. To our knowledge this is the first example of using sample-specific RNA-Seq data to create a splice-junction database and discover new peptides resulting from alternative splicing.
用于 MS 肽鉴定的人类蛋白质组学数据库经常更新和精心维护,但仍然不完整,因为从每个组织和细胞类型中表达的多种蛋白质形式中获取每个蛋白质序列具有挑战性。特别是,已经证明选择性剪接是这种细胞特异性蛋白质组学变异的主要来源。许多新的选择性剪接形式已在转录水平上使用下一代测序方法(尤其是 RNA-Seq)检测到,但尚不清楚有多少转录本正在翻译。利用下一代测序方法的空前功能,我们从同一细胞群(Jurkat 细胞)收集了 RNA-Seq 和蛋白质组学数据,并创建了一个生物信息学管道,该管道为发现新型剪接接头肽构建了定制数据库。使用 8000 万个配对末端 Illumina 读取和约 500,000 个串联质谱,鉴定了 12873 个转录本(包括异构体的 19320 个)和 6810 个蛋白质。我们开发了一种生物信息学工作流程,从 RNA 数据中检索高可信度的新型剪接接头序列,将这些序列翻译成类似的多肽序列,并为 MS 搜索创建定制的剪接接头数据库。基于 RefSeq 基因模型,我们检测到 136123 个注释和 144818 个未注释的转录本接头。其中,24834 个未注释的接头通过了各种质量过滤(例如,最小读取深度),并且这些条目被翻译成 33589 个多肽序列,并用于数据库搜索。我们发现了 57 个不在 Uniprot-Trembl 蛋白质组学数据库中的剪接接头肽,这些肽包含一系列不同的剪接事件,包括外显子跳过、替代供体和受体以及非规范转录起始位点。据我们所知,这是首次使用特定于样本的 RNA-Seq 数据创建剪接接头数据库并发现由选择性剪接产生的新肽的示例。