Alzaid Eman, Allali Achraf El
Computer Science Department, King Saud University, Riyadh, Saudi Arabia.
Department of Computer Science, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia.
Bioinform Biol Insights. 2020 Jan 20;14:1177932219892957. doi: 10.1177/1177932219892957. eCollection 2020.
Genomic structural variations are significant causes of genome diversity and complex diseases. With advances in sequencing technologies, many algorithms have been designed to identify structural differences using next-generation sequencing (NGS) data. Due to repetitions in the human genome and the short reads produced by NGS, the discovery of structural variants (SVs) by state-of-the-art SV callers is not always accurate. To improve performance, multiple SV callers are often used to detect variants. However, most SV callers suffer from high false-positive rates, which diminishes the overall performance, especially in low-coverage genomes. In this article, we propose a post-processing classification-based algorithm that can be used to filter structural variation predictions produced by SV callers. Novel features are defined from putative SV predictions using reads at the local regions around the breakpoints. Several classifiers are employed to classify the candidate predictions and remove false positives. We test our classifier models on simulated and real genomes and show that the proposed approach improves the performance of state-of-the-art algorithms.
基因组结构变异是基因组多样性和复杂疾病的重要成因。随着测序技术的进步,已设计出许多算法,用于利用下一代测序(NGS)数据识别结构差异。由于人类基因组中的重复序列以及NGS产生的短读长,最先进的结构变异(SV)检测工具发现结构变异并不总是准确的。为了提高性能,通常使用多个SV检测工具来检测变异。然而,大多数SV检测工具存在较高的假阳性率,这降低了整体性能,尤其是在低覆盖度基因组中。在本文中,我们提出了一种基于后处理分类的算法,可用于过滤SV检测工具产生的结构变异预测结果。利用断点周围局部区域的读段,从假定的SV预测中定义了新的特征。采用多个分类器对候选预测进行分类并去除假阳性。我们在模拟基因组和真实基因组上测试了我们的分类器模型,结果表明所提出的方法提高了最先进算法的性能。