School of Computing, National University of Singapore, 13 Computing Drive, 117417, Singapore.
NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, 28 Medical Drive, 117456, Singapore.
Nucleic Acids Res. 2018 Nov 16;46(20):e122. doi: 10.1093/nar/gky685.
Transpositions transfer DNA segments between different loci within a genome; in particular, when a transposition is found in a sample but not in a reference genome, it is called a non-reference transposition. They are important structural variations that have clinical impact. Transpositions can be called by analyzing second generation high-throughput sequencing datasets. Current methods follow either a database-based or a database-free approach. Database-based methods require a database of transposable elements. Some of them have good specificity; however this approach cannot detect novel transpositions, and it requires a good database of transposable elements, which is not yet available for many species. Database-free methods perform de novo calling of transpositions, but their accuracy is low. We observe that this is due to the misalignment of the reads; since reads are short and the human genome has many repeats, false alignments create false positive predictions while missing alignments reduce the true positive rate. This paper proposes new techniques to improve database-free non-reference transposition calling: first, we propose a realignment strategy called one-end remapping that corrects the alignments of reads in interspersed repeats; second, we propose a SNV-aware filter that removes some incorrectly aligned reads. By combining these two techniques and other techniques like clustering and positive-to-negative ratio filter, our proposed transposition caller TranSurVeyor shows at least 3.1-fold improvement in terms of F1-score over existing database-free methods. More importantly, even though TranSurVeyor does not use databases of prior information, its performance is at least as good as existing database-based methods such as MELT, Mobster and Retroseq. We also illustrate that TranSurVeyor can discover transpositions that are not known in the current database.
转座(transposition)将 DNA 片段在基因组内的不同位置之间转移;特别是,当在样本中发现转座而在参考基因组中未发现时,它被称为非参考转座。它们是具有临床影响的重要结构变异。通过分析第二代高通量测序数据集可以对转座进行调用。当前的方法要么基于数据库,要么无数据库。基于数据库的方法需要转座元件数据库。其中一些方法具有良好的特异性;然而,这种方法无法检测新的转座,并且需要一个良好的转座元件数据库,但对于许多物种来说,这还不可用。无数据库的方法对转座进行从头调用,但它们的准确性较低。我们观察到这是由于读取的不对齐造成的;由于读取较短且人类基因组中有许多重复,错误对齐会产生假阳性预测,而缺失对齐会降低真阳性率。本文提出了改进无数据库非参考转座调用的新技术:首先,我们提出了一种称为一端重映射的重对齐策略,可纠正散布重复中的读取对齐;其次,我们提出了一种基于 SNV 的过滤器,可去除一些错误对齐的读取。通过结合这两种技术以及其他技术,如聚类和正负比过滤器,我们提出的转座调用程序 TranSurVeyor 在 F1 分数方面至少比现有的无数据库方法提高了 3.1 倍。更重要的是,即使 TranSurVeyor 不使用先前信息的数据库,其性能至少与现有的基于数据库的方法(如 MELT、Mobster 和 Retroseq)一样好。我们还表明,TranSurVeyor 可以发现当前数据库中未知的转座。