Epithelial Systems Biology Laboratory, Systems Biology Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland.
Physiol Genomics. 2020 Oct 1;52(10):485-491. doi: 10.1152/physiolgenomics.00048.2020. Epub 2020 Aug 31.
Long noncoding RNAs (lncRNAs) are intracellular transcripts longer than 200 nucleotides and lack protein-coding information. A subclass of lncRNA known as long intergenic noncoding RNAs (lincRNAs) are transcribed from genomic regions that share no overlap with annotated protein-coding genes. Increasing evidence has shown that some annotated lincRNA transcripts do in fact contain open reading frames (ORFs) encoding functional short peptides in the cell. Few robust methods for lincRNA-encoded peptide identification have been reported, and the tissue-specific expression of these peptides has been largely unexplored. Here we propose an integrative workflow for lincRNA-encoded peptide discovery and test it on the mouse kidney inner medulla (IM). In brief, low molecular weight protein fractions were enriched from homogenate of IMs and trypsinized into shorter peptides, which were sequenced by high resolution liquid chromatography-tandem mass spectrometry (LC-MS/MS). To curate a hypothetical lincRNA-encoded peptide database for peptide-spectrum matching following LC-MS/MS, we performed RNA-Seq on IMs, computationally removed reads overlapping with annotated protein-coding genes, and remapped the remaining reads to a database of mouse noncoding transcripts to infer lincRNA expression. Expressed lincRNAs were searched for ORFs by an existing rule-based algorithm, and translated ORFs were used for peptide-spectrum matching. Peptides identified by LC-MS/MS were further evaluated by using several quality control criteria and bioinformatics methods. We discovered three novel lincRNA-encoded peptides, which are conserved in mouse, rat, and human. The workflow can be adapted for discovery of small protein-coding genes in any species or tissue where noncoding transcriptome information is available.
长链非编码 RNA(lncRNA)是长度超过 200 个核苷酸且缺乏蛋白编码信息的细胞内转录本。长链非编码 RNA 的一个亚类,长基因间非编码 RNA(lincRNA),由与注释的蛋白编码基因没有重叠的基因组区域转录。越来越多的证据表明,一些注释的 lincRNA 转录本实际上包含开放阅读框(ORF),在细胞中编码功能短肽。目前已经报道了几种用于鉴定 lincRNA 编码肽的稳健方法,但这些肽的组织特异性表达在很大程度上尚未得到探索。在这里,我们提出了一种用于 lincRNA 编码肽发现的综合工作流程,并在小鼠肾脏髓质(IM)上进行了测试。简而言之,从 IM 匀浆中富集低分子量蛋白质分数,并将其用胰蛋白酶切成较短的肽,然后通过高分辨率液相色谱-串联质谱(LC-MS/MS)进行测序。为了在 LC-MS/MS 后进行肽谱匹配,整理一个假设的 lincRNA 编码肽数据库,我们对 IM 进行了 RNA-Seq,计算去除与注释的蛋白编码基因重叠的读数,并将剩余的读数重新映射到小鼠非编码转录本数据库,以推断 lincRNA 的表达。通过现有的基于规则的算法搜索表达的 lincRNA 的 ORF,并使用翻译的 ORF 进行肽谱匹配。通过使用几种质量控制标准和生物信息学方法进一步评估通过 LC-MS/MS 鉴定的肽。我们发现了三个新的 lincRNA 编码肽,这些肽在小鼠、大鼠和人类中保守。该工作流程可适应于任何具有非编码转录组信息的物种或组织中发现小蛋白编码基因。