Klepikova Anna V, Kasianov Artem S, Chesnokov Mikhail S, Lazarevich Natalia L, Penin Aleksey A, Logacheva Maria
Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.
A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.
PeerJ. 2017 Mar 16;5:e3091. doi: 10.7717/peerj.3091. eCollection 2017.
RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads.
To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes.
The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.
RNA测序是分析基因表达的一种有用工具。然而,其稳健性受到许多人为因素的极大影响。其中之一是重复 reads 的存在。
为了推断去除重复 reads 的不同方法对癌症基因组学中基因表达估计的影响,我们分析了肝细胞癌(HCC)和非肿瘤肝组织的配对样本。对每个样本应用了四种数据分析方案:不进行重复数据删除处理、使用SAMtools中实现的方法进行重复数据删除,以及基于一个或两个分子索引(MI)进行重复数据删除。我们还分析了测序布局(单端 reads 或双端 reads)和读长的影响。我们发现不使用MI的重复数据删除会极大地影响估计的表达值;这种影响在高表达基因中最为明显。
使用独特分子标识符可大大提高RNA测序分析的准确性,尤其是对于高表达基因。我们开发了一组脚本,能够处理MI并将其纳入RNA测序分析流程。不使用MI的重复数据删除会影响差异基因表达分析的结果,产生高比例的假阴性结果。不进行重复 reads 删除会偏向于产生假阳性结果。在无法使用MI的情况下,我们建议使用双端测序布局。