Université Paris Saclay, CEA, INRAE, Département Médicaments et Technologies pour la Santé, SPI, 30200, Bagnols-sur-Cèze, France.
INRAE, UR RiverLY Laboratoire d'écotoxicologie, Centre de Lyon-Villeurbanne, Villeurbanne, F-69625, France.
Proteomics. 2020 May;20(10):e1900261. doi: 10.1002/pmic.201900261. Epub 2020 May 18.
Proteogenomics is gaining momentum as, today, genomics, transcriptomics, and proteomics can be readily performed on any new species. This approach allows key alterations to molecular pathways to be identified when comparing conditions. For animals and plants, RNA-seq-informed proteomics is the most popular means of interpreting tandem mass spectrometry spectra acquired for species for which the genome has not yet been sequenced. It relies on high-performance de novo RNA-seq assembly and optimized translation strategies. Here, several pre-treatments for Illumina RNA-seq reads before assembly are explored to translate the resulting contigs into useful polypeptide sequences. Experimental transcriptomics and proteomics datasets acquired for individual Gammarus fossarum freshwater crustaceans are used, the most relevant procedure is defined by the ratio of MS/MS spectra assigned to peptide sequences. Removing reads with a mean quality score of less than 17-which represents a single probable nucleotide error on 150-bp reads-prior to assembly, increases the proteomics outcome. The best translation using Transdecoder is achieved with a minimal open reading frame length of 50 amino acids and systematic selection of ORFs longer than 900 nucleotides. Using these parameters, transcriptome assembly and translation informed by proteomics pave the way to further improvements in proteogenomics.
蛋白质组学正在兴起,因为今天,基因组学、转录组学和蛋白质组学可以很容易地在任何新物种上进行。这种方法允许在比较条件时识别分子途径的关键改变。对于动物和植物,RNA-seq 指导的蛋白质组学是解释尚未测序的基因组物种获得的串联质谱图谱的最流行方法。它依赖于高性能从头 RNA-seq 组装和优化的翻译策略。在这里,探索了几种在组装前对 Illumina RNA-seq 读段进行预处理的方法,以将得到的连续序列转化为有用的多肽序列。使用了单个淡水甲壳类动物淡水蚤的实验转录组学和蛋白质组学数据集,最相关的过程由分配给肽序列的 MS/MS 光谱与读段的比例来定义。在组装之前,去除平均质量评分低于 17 的读段——这代表在 150 个碱基读段上可能有一个单个核苷酸错误——可以提高蛋白质组学结果。使用 Transdecoder 进行的最佳翻译需要最小的开放阅读框长度为 50 个氨基酸,并且系统地选择大于 900 个核苷酸的 ORF。使用这些参数,转录组组装和蛋白质组学指导的翻译为蛋白质组学的进一步改进铺平了道路。