Woo Sunghee, Cha Seong Won, Merrihew Gennifer, He Yupeng, Castellana Natalie, Guest Clark, MacCoss Michael, Bafna Vineet
Department of Electrical and Computing Engineering, ¶Department of Bioinformatics and Systems Biology, and §Department of Computer Science, University of California, San Diego , La Jolla, California 92093, United States.
J Proteome Res. 2014 Jan 3;13(1):21-8. doi: 10.1021/pr400294c. Epub 2013 Jul 17.
The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496.2 GB of aligned RNA-seq SAM files to 410 MB of splice graph database written in FASTA format. This corresponds to 1000× compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom data set, using a completely automated pipeline, and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame shifts, 1166 reverse strands, and 42 translated UTRs. Our results highlight the usefulness of transcript + proteomic integration for improved genome annotations.
廉价的RNA测序技术及其他用于RNA的深度测序技术的出现,有望从根本上改善基因组注释,提供多种细胞条件下转录区域和剪接事件的信息。利用基于质谱的蛋白质基因组学,其中许多事件可以在蛋白质水平直接得到证实。然而,整合大量冗余的RNA测序数据和质谱数据带来了一个具有挑战性的问题。我们的论文通过构建一个包含RNA测序读段中所有有用信息的紧凑数据库来解决这个问题。将我们的方法应用于累积的秀丽隐杆线虫数据,把496.2GB的比对RNA测序SAM文件压缩到了410MB以FASTA格式编写的剪接图数据库。这相当于数据大小压缩了1000倍,且不损失灵敏度。我们使用自定义数据集,通过一个完全自动化的流程进行了一项蛋白质基因组学研究,总共鉴定出4044个新事件,包括215个新基因、808个新外显子、12个可变剪接、618个基因边界校正、245个外显子边界变化、938个移码、1166个反向链和42个翻译后的非翻译区。我们的结果突出了转录本+蛋白质组学整合对于改进基因组注释的有用性。