Penn Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA.
Bioinformatics. 2011 Sep 15;27(18):2518-28. doi: 10.1093/bioinformatics/btr427. Epub 2011 Jul 19.
A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription-polymerase chain reaction (RT-PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (http://cbil.upenn.edu/RUM).
ggrant@pcbi.upenn.edu; epierce@mail.med.upenn.edu
The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.
高通量测序中的一个关键任务是将数百万个短读段与参考基因组对齐。由于 RNA 剪接,RNA 测序 (RNA-Seq) 的对齐尤其复杂。有许多 RNA-Seq 算法可用,并且声称在检测剪接接头的同时具有高精度和高效率的读对齐。RNA-Seq 数据本质上是离散的;因此,在具有合理的基因模型和比较指标的情况下,可以对 RNA-Seq 数据进行模拟,以达到足够的准确性,从而能够对对齐算法进行有意义的基准测试。以前没有进行过严格比较所有可行的已发表 RNA-Seq 算法的工作。
我们开发了一种 RNA-Seq 模拟器,该模拟器可模拟 RNA 对齐的主要障碍,包括可变剪接、插入、缺失、替换、测序错误和内含子信号。我们使用此模拟器来衡量可用算法在碱基和接头级别上的准确性和鲁棒性。此外,我们使用逆转录-聚合酶链反应 (RT-PCR) 和 Sanger 测序来验证算法在检测新型转录物特征(如小鼠视网膜 RNA-Seq 数据中的新型外显子和可变剪接)方面的能力。开发了一个基于 BLAT 的管道来探索针对此问题的现有工具的性能,并将其与最近开发的方法进行比较。这个名为 RNA-Seq 统一映射器 (RUM) 的管道与当前最好的对齐器表现相当,并提供了准确性、速度和可用性的优势组合。
RUM 管道通过 Amazon Cloud 分发,并可在使用 Sun Grid Engine(http://cbil.upenn.edu/RUM)的计算集群上使用。
ggrant@pcbi.upenn.edu; epierce@mail.med.upenn.edu
文章中描述的 RNA-Seq 序列读取已存储在 GEO 中,访问号为 GSE26248。