Wen Bo, Xu Shaohang, Zhou Ruo, Zhang Bing, Wang Xiaojing, Liu Xin, Xu Xun, Liu Siqi
BGI-Shenzhen, Shenzhen, 518083, China.
Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA.
BMC Bioinformatics. 2016 Jun 17;17(1):244. doi: 10.1186/s12859-016-1133-3.
Peptide identification based upon mass spectrometry (MS) is generally achieved by comparison of the experimental mass spectra with the theoretically digested peptides derived from a reference protein database. Obviously, this strategy could not identify peptide and protein sequences that are absent from a reference database. A customized protein database on the basis of RNA-Seq data is thus proposed to assist with and improve the identification of novel peptides. Correspondingly, development of a comprehensive pipeline, which provides an end-to-end solution for novel peptide detection with the customized protein database, is necessary.
A pipeline with an R package, assigned as a PGA utility, was developed that enables automated treatment to the tandem mass spectrometry (MS/MS) data acquired from different MS platforms and construction of customized protein databases based on RNA-Seq data with or without a reference genome guide. Hence, PGA can identify novel peptides and generate an HTML-based report with a visualized interface. On the basis of a published dataset, PGA was employed to identify peptides, resulting in 636 novel peptides, including 510 single amino acid polymorphism (SAP) peptides, 2 INDEL peptides, 49 splice junction peptides, and 75 novel transcript-derived peptides. The software is freely available from http://bioconductor.org/packages/PGA/ , and the example reports are available at http://wenbostar.github.io/PGA/ .
The pipeline of PGA, aimed at being platform-independent and easy-to-use, was successfully developed and shown to be capable of identifying novel peptides by searching the customized protein database derived from RNA-Seq data.
基于质谱(MS)的肽段鉴定通常是通过将实验质谱图与从参考蛋白质数据库中理论上酶解得到的肽段进行比较来实现的。显然,这种策略无法鉴定参考数据库中不存在的肽段和蛋白质序列。因此,提出了一个基于RNA测序数据的定制蛋白质数据库,以辅助并改进新型肽段的鉴定。相应地,开发一个全面的流程来为使用定制蛋白质数据库进行新型肽段检测提供端到端的解决方案是必要的。
开发了一个带有R包的流程,命名为PGA工具,它能够对从不同质谱平台获取的串联质谱(MS/MS)数据进行自动化处理,并基于有无参考基因组指导的RNA测序数据构建定制蛋白质数据库。因此,PGA能够鉴定新型肽段并生成具有可视化界面的基于HTML的报告。基于一个已发表的数据集,使用PGA来鉴定肽段,结果得到636个新型肽段,包括510个单氨基酸多态性(SAP)肽段、2个插入缺失肽段、49个剪接连接肽段和75个新型转录本衍生肽段。该软件可从http://bioconductor.org/packages/PGA/免费获取,示例报告可在http://wenbostar.github.io/PGA/获取。
成功开发了旨在独立于平台且易于使用的PGA流程,并且通过搜索从RNA测序数据衍生的定制蛋白质数据库,该流程能够鉴定新型肽段。