Lee Soohyun, Seo Chae Hwa, Alver Burak Han, Lee Sanghyuk, Park Peter J
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Emerging Technology Center, DNA link, Seoul, South Korea.
BMC Bioinformatics. 2015 Sep 3;16:278. doi: 10.1186/s12859-015-0704-z.
RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost.
We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods.
EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar.
RNA测序(RNA-seq)已广泛用于全基因组表达谱分析。RNA-seq数据通常由来自不同转录本的数千万条短测序读段组成。然而,由于基因之间以及异构体之间的序列相似性,给定读段的来源往往不明确。现有的从RNA-seq读段估计表达水平的方法往往在准确性和计算成本之间进行权衡。
我们引入了一种从RNA-seq数据中量化转录本丰度的新方法。EMSAR(基于可映射性的分割和重新聚类估计)根据读段所映射到的转录本集合对读段进行分组,并使用联合泊松模型为每个最优转录本片段集找到最大似然估计。该方法使用了几乎所有映射的读段,包括那些映射到多个基因的读段。通过基于修改后的后缀数组的高效转录组索引,EMSAR在实现与现有最佳方法相当的准确性的同时,最大限度地减少了CPU时间和内存的使用。
EMSAR是一种以高精度和低计算成本从RNA-seq数据中量化转录本的方法。可在https://github.com/parklab/emsar获取EMSAR。