Deschamps-Francoeur Gabrielle, Simoneau Joël, Scott Michelle S
Département de Biochimie et Génomique Fonctionnelle, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, QC J1E 4K8, Canada.
Comput Struct Biotechnol J. 2020 Jun 12;18:1569-1576. doi: 10.1016/j.csbj.2020.06.014. eCollection 2020.
Many eukaryotic genomes harbour large numbers of duplicated sequences, of diverse biotypes, resulting from several mechanisms including recombination, whole genome duplication and -transposition. Such repeated sequences complicate gene/transcript quantification during RNA-seq analysis due to reads mapping to more than one locus, sometimes involving genes embedded in other genes. Genes of different biotypes have dissimilar levels of sequence duplication, with long-noncoding RNAs and messenger RNAs sharing less sequence similarity to other genes than biotypes encoding shorter RNAs. Many strategies have been elaborated to handle these multi-mapped reads, resulting in increased accuracy in gene/transcript quantification, although separate tools are typically used to estimate the abundance of short and long genes due to their dissimilar characteristics. This review discusses the mechanisms leading to sequence duplication, the biotypes affected, the computational strategies employed to deal with multi-mapped reads and the challenges that still remain to be overcome.
许多真核生物基因组含有大量不同生物类型的重复序列,这些重复序列由多种机制产生,包括重组、全基因组复制和转座。由于 reads 可映射到多个位点,有时还涉及嵌入其他基因中的基因,这些重复序列使 RNA-seq 分析中的基因/转录本定量变得复杂。不同生物类型的基因具有不同程度的序列重复,与编码较短 RNA 的生物类型相比,长链非编码 RNA 和信使 RNA 与其他基因的序列相似性较低。尽管由于短基因和长基因具有不同的特征,通常使用不同的工具来估计它们的丰度,但已经制定了许多策略来处理这些多重映射的 reads,从而提高了基因/转录本定量的准确性。本综述讨论了导致序列重复的机制、受影响的生物类型、用于处理多重映射 reads 的计算策略以及仍有待克服的挑战。