College of Engineering, Shantou University, Shantou, 515063, China.
College of Engineering, Shantou University, Shantou, 515063, China.
Comput Biol Med. 2024 Feb;169:107932. doi: 10.1016/j.compbiomed.2024.107932. Epub 2024 Jan 1.
Off-target effects of CRISPR/Cas9 can lead to suboptimal genome editing outcomes. Numerous deep learning-based approaches have achieved excellent performance for off-target prediction; however, few can predict the off-target activities with both mismatches and indels between single guide RNA (sgRNA) and target DNA sequence pair. In addition, data imbalance is a common pitfall for off-target prediction. Moreover, due to the complexity of genomic contexts, generating an interpretable model also remains challenged. To address these issues, firstly we developed a BERT-based model called CRISPR-BERT for enhancing the prediction of off-target activities with both mismatches and indels. Secondly, we proposed an adaptive batch-wise class balancing strategy to combat the noise exists in imbalanced off-target data. Finally, we applied a visualization approach for investigating the generalizable nucleotide position-dependent patterns of sgRNA-DNA pair for off-target activity. In our comprehensive comparison to existing methods on five mismatches-only datasets and two mismatches-and-indels datasets, CRISPR-BERT achieved the best performance in terms of AUROC and PRAUC. Besides, the visualization analysis demonstrated how implicit knowledge learned by CRISPR-BERT facilitates off-target prediction, which shows potential in model interpretability. Collectively, CRISPR-BERT provides an accurate and interpretable framework for off-target prediction, further contributes to sgRNA optimization in practical use for improved target specificity in CRISPR/Cas9 genome editing. The source code is available at https://github.com/BrokenStringx/CRISPR-BERT.
CRISPR/Cas9 的脱靶效应可能导致基因组编辑结果不理想。许多基于深度学习的方法在脱靶预测方面取得了优异的性能;然而,很少有方法可以预测 sgRNA 和目标 DNA 序列对之间存在错配和插入/缺失的脱靶活性。此外,数据不平衡是脱靶预测的常见陷阱。此外,由于基因组背景的复杂性,生成可解释的模型仍然具有挑战性。为了解决这些问题,我们首先开发了一种基于 BERT 的模型,称为 CRISPR-BERT,用于增强对存在错配和插入/缺失的脱靶活性的预测。其次,我们提出了一种自适应分批类平衡策略,以克服不平衡脱靶数据中的噪声。最后,我们应用了一种可视化方法来研究 sgRNA-DNA 对脱靶活性的通用核苷酸位置相关模式。在我们对五个仅存在错配数据集和两个存在错配和插入/缺失数据集的现有方法的综合比较中,CRISPR-BERT 在 AUROC 和 PRAUC 方面取得了最佳性能。此外,可视化分析展示了 CRISPR-BERT 学习的隐含知识如何有助于脱靶预测,这表明模型可解释性方面具有潜力。总之,CRISPR-BERT 为脱靶预测提供了一个准确和可解释的框架,进一步有助于 sgRNA 在 CRISPR/Cas9 基因组编辑中的实际应用中的优化,以提高靶标特异性。源代码可在 https://github.com/BrokenStringx/CRISPR-BERT 上获得。