Krasnov George Sergeevich, Dmitriev Alexey Alexandrovich, Kudryavtseva Anna Viktorovna, Shargunov Alexander Valerievich, Karpov Dmitry Sergeevich, Uroshlev Leonid Andreevich, Melnikova Natalya Vladimirovna, Blinov Vladimir Mikhailovich, Poverennaya Ekaterina Vladimirovna, Archakov Alexander Ivanovich, Lisitsa Andrey Valerievich, Ponomarenko Elena Alexandrovna
Engelhardt Institute of Molecular Biology, Russian Academy of Sciences , Moscow, 111991 Russia.
Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences , Moscow, 119121 Russia.
J Proteome Res. 2015 Sep 4;14(9):3729-37. doi: 10.1021/acs.jproteome.5b00490. Epub 2015 Aug 3.
The fundamental mission of the Chromosome-Centric Human Proteome Project (C-HPP) is the research of human proteome diversity, including rare variants. Liver tissues, HepG2 cells, and plasma were selected as one of the major objects for C-HPP studies. The proteogenomic approach, a recently introduced technique, is a powerful method for predicting and validating proteoforms coming from alternative splicing, mutations, and transcript editing. We developed PPLine, a Python-based proteogenomic pipeline providing automated single-amino-acid polymorphism (SAP), indel, and alternative-spliced-variants discovery based on raw transcriptome and exome sequence data, single-nucleotide polymorphism (SNP) annotation and filtration, and the prediction of proteotypic peptides (available at https://sourceforge.net/projects/ppline). In this work, we performed deep transcriptome sequencing of HepG2 cells and liver tissues using two platforms: Illumina HiSeq and Applied Biosystems SOLiD. Using PPLine, we revealed 7756 SAP and indels for HepG2 cells and liver (including 659 variants nonannotated in dbSNP). We found 17 indels in transcripts associated with the translation of alternate reading frames (ARF) longer than 300 bp. The ARF products of two genes, SLMO1 and TMEM8A, demonstrate signatures of caspase-binding domain and Gcn5-related N-acetyltransferase. Alternative splicing analysis predicted novel proteoforms encoded by 203 (liver) and 475 (HepG2) genes according to both Illumina and SOLiD data. The results of the present work represent a basis for subsequent proteomic studies by the C-HPP consortium.
以染色体为中心的人类蛋白质组计划(C-HPP)的基本任务是研究人类蛋白质组多样性,包括罕见变异体。肝脏组织、HepG2细胞和血浆被选为C-HPP研究的主要对象之一。蛋白质基因组学方法是一种最近引入的技术,是预测和验证来自可变剪接、突变和转录本编辑的蛋白质异构体的有力方法。我们开发了PPLine,这是一个基于Python的蛋白质基因组学流程,可基于原始转录组和外显子组序列数据自动发现单氨基酸多态性(SAP)、插入缺失以及可变剪接变体,进行单核苷酸多态性(SNP)注释和过滤,并预测蛋白质型肽段(可从https://sourceforge.net/projects/ppline获取)。在这项工作中,我们使用Illumina HiSeq和Applied Biosystems SOLiD两个平台对HepG2细胞和肝脏组织进行了深度转录组测序。使用PPLine,我们在HepG2细胞和肝脏中发现了7756个SAP和插入缺失(包括659个在dbSNP中未注释的变异体)。我们在与长度超过300 bp的交替阅读框(ARF)翻译相关的转录本中发现了17个插入缺失。两个基因SLMO1和TMEM8A的ARF产物显示出胱天蛋白酶结合域和Gcn5相关N-乙酰转移酶的特征。根据Illumina和SOLiD数据,可变剪接分析预测了由203个(肝脏)和475个(HepG2)基因编码的新型蛋白质异构体。本研究结果为C-HPP联盟后续的蛋白质组学研究奠定了基础。