Górczak Katarzyna, Claesen Jürgen, Burzykowski Tomasz
Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.
Department of Mathematical and Statistical Methods, Poznań University of Life Sciences, Poznań, Poland.
J Comput Biol. 2020 Aug;27(8):1232-1247. doi: 10.1089/cmb.2019.0272. Epub 2019 Dec 31.
RNA sequencing (RNA-seq) is widely used to study gene-, transcript-, or exon expression. To quantify the expression level, millions of short sequenced reads need to be mapped back to a reference genome or transcriptome. Read mapping makes it possible to find a location to which a read is identical or similar. Based upon this alignment, expression summaries, that is, read counts are generated. However, reads may be matched to multiple locations. Such ambiguously mapped reads are often ignored in the analysis, which is a potential loss of information and may cause bias in expression estimation. We present the general principles underlying multiread allocation and unbiased estimation of the expression level of genes, exons, or transcripts in the presence of multiple mapped reads. The underlying principles are derived from a theoretical concept that identifies important sources of information such as the number of uniquely mapped reads, the total target length, and the length of the shared target regions. We show with simulation studies that methods incorporating some or all of the aforementioned sources of information estimate the expression levels of genes, exons, and/or transcripts with a higher precision and accuracy than methods that do not use this information. We identify important sources of information that should be taken into account by methods that estimate the abundance of genes, exons, and/or transcripts to achieve good precision and accuracy.
RNA测序(RNA-seq)被广泛用于研究基因、转录本或外显子的表达。为了量化表达水平,数百万条短测序读段需要被映射回参考基因组或转录组。读段映射使得找到读段与之相同或相似的位置成为可能。基于这种比对,生成表达汇总,即读段计数。然而,读段可能会匹配到多个位置。这种映射不明确的读段在分析中常常被忽略,这是一种潜在的信息损失,并且可能导致表达估计出现偏差。我们阐述了在存在多个映射读段的情况下,多读段分配以及对基因、外显子或转录本表达水平进行无偏估计的一般原则。这些基本原则源自一个理论概念,该概念确定了重要的信息来源,如唯一映射读段的数量、总目标长度以及共享目标区域的长度。我们通过模拟研究表明,纳入部分或所有上述信息来源的方法比不使用这些信息的方法能更精确、准确地估计基因、外显子和/或转录本的表达水平。我们确定了估计基因、外显子和/或转录本丰度的方法为实现良好的精度和准确性应考虑的重要信息来源。