Yang Andrian, Tang Joshua Y S, Troup Michael, Ho Joshua W K
Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia.
St. Vincent's Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia.
F1000Res. 2019 Sep 4;8:1587. doi: 10.12688/f1000research.19426.2. eCollection 2019.
Read alignment is an important step in RNA-seq analysis as the result of alignment forms the basis for downstream analyses. However, recent studies have shown that published alignment tools have variable mapping sensitivity and do not necessarily align all the reads which should have been aligned, a problem we termed as the false-negative non-alignment problem. Here we present Scavenger, a python-based bioinformatics pipeline for recovering unaligned reads using a novel mechanism in which a putative alignment location is discovered based on sequence similarity between aligned and unaligned reads. We showed that Scavenger could recover unaligned reads in a range of simulated and real RNA-seq datasets, including single-cell RNA-seq data. We found that recovered reads tend to contain more genetic variants with respect to the reference genome compared to previously aligned reads, indicating that divergence between personal and reference genomes plays a role in the false-negative non-alignment problem. Even when the number of recovered reads is relatively small compared to the total number of reads, the addition of these recovered reads can impact downstream analyses, especially in terms of estimating the expression and differential expression of lowly expressed genes, such as pseudogenes.
读段比对是RNA测序分析中的重要步骤,因为比对结果构成了下游分析的基础。然而,最近的研究表明,已发表的比对工具具有不同的映射灵敏度,不一定能比对所有应该被比对的读段,我们将这个问题称为假阴性未比对问题。在此,我们介绍Scavenger,这是一个基于Python的生物信息学流程,它使用一种新机制来恢复未比对的读段,该机制基于已比对和未比对读段之间的序列相似性来发现假定的比对位置。我们表明,Scavenger可以在一系列模拟和真实的RNA测序数据集中恢复未比对的读段,包括单细胞RNA测序数据。我们发现,与先前比对的读段相比,恢复的读段相对于参考基因组往往包含更多的遗传变异,这表明个人基因组与参考基因组之间的差异在假阴性未比对问题中起作用。即使与读段总数相比,恢复的读段数量相对较少,添加这些恢复的读段也会影响下游分析,特别是在估计低表达基因(如假基因)的表达和差异表达方面。