Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.
Department of Computer Science and Information Technology, Northeast Normal University, Changchun, China and.
Bioinformatics. 2019 Apr 1;35(7):1108-1115. doi: 10.1093/bioinformatics/bty748.
The RNA-guided CRISPR/Cas9 system has been widely applied to genome editing. CRISPR/Cas9 system can effectively edit the on-target genes. Nonetheless, it has recently been demonstrated that many homologous off-target genomic sequences could be mutated, leading to unexpected gene-editing outcomes. Therefore, a plethora of tools were proposed for the prediction of off-target activities of CRISPR/Cas9. Nonetheless, each computational tool has its own advantages and drawbacks under diverse conditions. It is hardly believed that a single tool is optimal for all conditions. Hence, we would like to explore the ensemble learning potential on synergizing multiple tools with genomic annotations together to enhance its predictive abilities.
We proposed an ensemble learning framework which synergizes multiple tools together to predict the off-target activities of CRISPR/Cas9 in different combinations. Interestingly, the ensemble learning using AdaBoost outperformed other individual off-target predictive tools. We also investigated the effect of evolutionary conservation (PhyloP and PhastCons) and chromatin annotations (ChromHMM and Segway) and found that only PhyloP can enhance the predictive capabilities further. Case studies are conducted to reveal ensemble insights into the off-target predictions, demonstrating how the current study can be applied in different genomic contexts. The best prediction predicted by AdaBoost is up to 0.9383 (AUC) and 0.2998 (PRC) that outperforms other classifiers. This is ascribable to the fact that AdaBoost introduces a new weak classifier (i.e. decision stump) in each iteration to learn the DNA sequences that were misclassified as off-targets until a small error rate is reached iteratively.
The source codes are freely available on GitHub at https://github.com/Alexzsx/CRISPR.
Supplementary data are available at Bioinformatics online.
RNA 引导的 CRISPR/Cas9 系统已被广泛应用于基因组编辑。CRISPR/Cas9 系统可以有效地编辑靶基因。然而,最近已经证明,许多同源的脱靶基因组序列可能会发生突变,导致意想不到的基因编辑结果。因此,已经提出了许多工具来预测 CRISPR/Cas9 的脱靶活性。然而,每个计算工具在不同的条件下都有其自身的优势和缺点。很难相信单个工具在所有条件下都是最优的。因此,我们希望探索集成学习的潜力,将多个工具与基因组注释结合起来,以提高其预测能力。
我们提出了一个集成学习框架,该框架将多个工具协同作用,以预测 CRISPR/Cas9 在不同组合中的脱靶活性。有趣的是,使用 AdaBoost 的集成学习优于其他单个脱靶预测工具。我们还研究了进化保守性(PhyloP 和 PhastCons)和染色质注释(ChromHMM 和 Segway)的影响,发现只有 PhyloP 可以进一步提高预测能力。案例研究揭示了集成学习在脱靶预测中的洞察力,展示了本研究如何应用于不同的基因组背景。AdaBoost 预测的最佳预测值高达 0.9383(AUC)和 0.2998(PRC),优于其他分类器。这归因于 AdaBoost 在每个迭代中引入一个新的弱分类器(即决策树桩)来学习被错误分类为脱靶的 DNA 序列,直到达到小的错误率为止。
源代码可在 GitHub 上免费获得,网址为 https://github.com/Alexzsx/CRISPR。
补充数据可在 Bioinformatics 在线获得。