Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
BMC Genomics. 2013 Aug 7;14:536. doi: 10.1186/1471-2164-14-536.
RNA-seq can be used to measure allele-specific expression (ASE) by assigning sequence reads to individual alleles; however, relative ASE is systematically biased when sequence reads are aligned to a single reference genome. Aligning sequence reads to both parental genomes can eliminate this bias, but this approach is not always practical, especially for non-model organisms. To improve accuracy of ASE measured using a single reference genome, we identified properties of differentiating sites responsible for biased measures of relative ASE.
We found that clusters of differentiating sites prevented sequence reads from an alternate allele from aligning to the reference genome, causing a bias in relative ASE favoring the reference allele. This bias increased with greater sequence divergence between alleles. Increasing the number of mismatches allowed when aligning sequence reads to the reference genome and restricting analysis to genomic regions with fewer differentiating sites than the number of mismatches allowed almost completely eliminated this systematic bias. Accuracy of allelic abundance was increased further by excluding differentiating sites within sequence reads that could not be aligned uniquely within the genome (imperfect mappability) and reads that overlapped one or more insertions or deletions (indels) between alleles.
After aligning sequence reads to a single reference genome, excluding differentiating sites with at least as many neighboring differentiating sites as the number of mismatches allowed, imperfect mappability, and/or an indel(s) nearby resulted in measures of allelic abundance comparable to those derived from aligning sequence reads to both parental genomes.
RNA-seq 可通过将序列读取分配给个体等位基因来测量等位基因特异性表达(ASE);然而,当序列读取与单个参考基因组对齐时,相对 ASE 会受到系统偏差的影响。将序列读取对齐到两个亲本基因组可以消除这种偏差,但这种方法并不总是实用的,尤其是对于非模式生物。为了提高使用单个参考基因组测量的 ASE 的准确性,我们确定了导致相对 ASE 偏倚的区分位点的特性。
我们发现区分位点簇阻止了交替等位基因的序列读取与参考基因组对齐,从而导致参考等位基因的相对 ASE 产生偏差。这种偏差随着等位基因之间的序列差异增大而增加。增加允许在参考基因组上对齐序列读取时的错配数,并将分析限制在允许错配数少于区分位点数的基因组区域,几乎可以完全消除这种系统偏差。通过排除无法在基因组内唯一对齐的序列读取内的区分位点(不完全可映射性)以及与等位基因之间的一个或多个插入或缺失(indel)重叠的读取,进一步提高了等位基因丰度的准确性。
在将序列读取与单个参考基因组对齐后,排除具有与允许的错配数一样多的相邻区分位点、不完全可映射性和/或附近的 indel(s) 的区分位点,可得到与从将序列读取对齐到两个亲本基因组相比可比的等位基因丰度测量值。