School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel.
Department of Computer Science, Bar-Ilan University, Ramat Gan 5290002, Israel.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae481.
CRISPR/Cas9 technology has been revolutionizing the field of gene editing. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs and so computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have a low correlation with functional and endogenous datasets, which are too small to train accurate machine-learning models on.
We developed DeepCRISTL, a deep-learning model to predict the editing efficiency in a specific cellular context. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA editing efficiency and then fine-tunes the model on functional or endogenous data to fit a specific cellular context. We tested two state-of-the-art models trained on high-throughput datasets for editing efficiency prediction, our newly improved DeepHF and CRISPRon, combined with various transfer-learning approaches. The combination of CRISPRon and fine-tuning all model weights was the overall best performer. DeepCRISTL outperformed state-of-the-art methods in predicting editing efficiency in a specific cellular context on functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by DeepCRISTL across cellular contexts. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging transfer learning to utilize both high-throughput datasets and smaller and more biologically relevant datasets.
DeepCRISTL is available via https://github.com/OrensteinLab/DeepCRISTL.
CRISPR/Cas9 技术正在彻底改变基因编辑领域。向导 RNA(gRNA)使 Cas9 蛋白能够靶向特定的基因组位点进行编辑。然而,gRNA 之间的编辑效率存在差异,因此开发了计算方法来预测任何感兴趣的 gRNA 的编辑效率。产生了高通量的 Cas9 编辑效率数据集来训练机器学习模型以预测编辑效率。然而,这些高通量数据集与功能和内源性数据集相关性较低,这些数据集太小,无法在其上训练准确的机器学习模型。
我们开发了 DeepCRISTL,这是一种深度学习模型,可预测特定细胞环境中的编辑效率。DeepCRISTL 利用高通量数据集来学习 gRNA 编辑效率的一般模式,然后在功能或内源性数据上对模型进行微调,以适应特定的细胞环境。我们测试了两种基于高通量数据集训练的用于编辑效率预测的最先进模型,即我们新改进的 DeepHF 和 CRISPRon,以及各种迁移学习方法。CRISPRon 与调整所有模型权重的组合是整体表现最好的。DeepCRISTL 在功能和内源性数据集上预测特定细胞环境中的编辑效率方面优于最先进的方法。使用显着性图,我们在不同的细胞环境中确定并比较了 DeepCRISTL 学到的重要特征。我们相信,通过利用迁移学习来利用高通量数据集和更小、更具生物学相关性的数据集,DeepCRISTL 将提高许多其他 CRISPR/Cas9 编辑环境中的预测性能。
DeepCRISTL 可通过 https://github.com/OrensteinLab/DeepCRISTL 获得。