Institute for Molecular Bioscience, The University of Queensland, St Lucia, Brisbane, Queensland 4072, Australia.
Bioinformatics. 2012 Dec 1;28(23):3042-50. doi: 10.1093/bioinformatics/bts582. Epub 2012 Oct 7.
Comparing transcriptomic data with proteomic data to identify protein-coding sequences is a long-standing challenge in molecular biology, one that is exacerbated by the increasing size of high-throughput datasets. To address this challenge, and thereby to improve the quality of genome annotation and understanding of genome biology, we have developed an integrated suite of programs, called Pinstripe. We demonstrate its application, utility and discovery power using transcriptomic and proteomic data from publicly available datasets.
To demonstrate the efficacy of Pinstripe for large-scale analysis, we applied Pinstripe's reverse peptide mapping pipeline to a transcript library including de novo assembled transcriptomes from the human Illumina Body Atlas (IBA2) and GENCODE v10 gene annotations, and the EBI Proteomics Identifications Database (PRIDE) peptide database. This analysis identified 736 canonical open reading frames (ORFs) supported by three or more PRIDE peptide fragments that are positioned outside any known coding DNA sequence (CDS). Because of the unfiltered nature of the PRIDE database and high probability of false discovery, we further refined this list using independent evidence for translation, including the presence of a Kozak sequence or functional domains, synonymous/non-synonymous substitution ratios and ORF length. Using this integrative approach, we observed evidence of translation from a previously unknown let7e primary transcript, the archetypical lncRNA H19, and a homolog of RD3. Reciprocally, by exclusion of transcripts with mapped peptides or significant ORFs (>80 codon), we identify 32 187 loci with RNAs longer than 2000 nt that are unlikely to encode proteins.
Pinstripe (pinstripe.matticklab.com) is freely available as source code or a Mono binary. Pinstripe is written in C# and runs under the Mono framework on Linux or Mac OS X, and both under Mono and .Net under Windows.
m.dinger@garvan.org.au or j.mattick@garvan.org.au
Supplementary data are available at Bioinformatics online.
将转录组数据与蛋白质组数据进行比较,以鉴定蛋白质编码序列,这是分子生物学中长期存在的挑战,而高通量数据集的不断增大则加剧了这一挑战。为了解决这一挑战,从而提高基因组注释的质量和对基因组生物学的理解,我们开发了一套集成的程序,称为 Pinstripe。我们使用来自公开数据集的转录组和蛋白质组数据来展示其应用、实用性和发现能力。
为了展示 Pinstripe 进行大规模分析的功效,我们将 Pinstripe 的反向肽映射管道应用于一个转录文库,该文库包括人类 Illumina 体图谱(IBA2)和 GENCODE v10 基因注释的从头组装转录组,以及 EBI 蛋白质组学鉴定数据库(PRIDE)肽数据库。该分析鉴定了 736 个经典开放阅读框(ORF),这些 ORF 由三个或更多的 PRIDE 肽片段支持,这些片段位于任何已知编码 DNA 序列(CDS)之外。由于 PRIDE 数据库的未过滤性质和高错误发现概率,我们使用翻译的独立证据进一步细化了这一列表,包括存在 Kozak 序列或功能域、同义/非同义替换比和 ORF 长度。使用这种综合方法,我们观察到了以前未知的 let7e 初级转录物、典型的 lncRNA H19 和 RD3 同源物翻译的证据。相反,通过排除具有映射肽或显著 ORF(>80 密码子)的转录本,我们鉴定了 32187 个具有大于 2000 个核苷酸的 RNA 的基因座,这些基因座不太可能编码蛋白质。
Pinstripe(pinstripe.matticklab.com)可作为源代码或 Mono 二进制文件免费获得。Pinstripe 是用 C#编写的,在 Linux 或 Mac OS X 下在 Mono 框架下运行,在 Windows 下在 Mono 和.Net 下运行。
m.dinger@garvan.org.au 或 j.mattick@garvan.org.au
补充数据可在生物信息学在线获得。