Sari Orhan, Liu Ziying, Pan Youlian, Shao Xiaojian
Department of Mining and Materials Engineering, McGill University, Montreal, QC, H3A 2B1, Canada.
Digital Technologies Research Center, National Research Council Canada, Ottawa, ON, K1A 0R6, Canada.
Bioinform Adv. 2024 Dec 30;5(1):vbae184. doi: 10.1093/bioadv/vbae184. eCollection 2025.
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-Cas9 system is a ground-breaking genome editing tool, which has revolutionized cell and gene therapies. One of the essential components involved in this system that ensures its success is the design of an optimal single-guide RNA (sgRNA) with high on-target cleavage efficiency and low off-target effects. This is challenging as many conditions need to be considered, and empirically testing every design is time-consuming and costly. prediction using machine learning models provides high-performance alternatives.
We present CrisprBERT, a deep learning model incorporating a Bidirectional Encoder Representations from Transformers (BERT) architecture to provide a high-dimensional embedding for paired sgRNA and DNA sequences and Bidirectional Long Short-term Memory networks for learning, to predict the off-target effects of sgRNAs utilizing only the sgRNAs and their paired DNA sequences. We proposed doublet stack encoding to capture the local energy configuration of the Cas9 binding and applied the BERT model to learn the contextual embedding of the doublet pairs. Our results showed that the new model achieved better performance than state-of-the-art deep learning models regarding single split and leave-one-sgRNA-out cross-validations as well as independent testing.
The CrisprBERT is available at GitHub: https://github.com/OSsari/CrisprBERT.
成簇规律间隔短回文重复序列(CRISPR)-Cas9系统是一种开创性的基因组编辑工具,它彻底改变了细胞和基因疗法。确保该系统成功的关键组成部分之一是设计具有高靶向切割效率和低脱靶效应的最佳单向导RNA(sgRNA)。由于需要考虑许多条件,并且对每个设计进行实证测试既耗时又昂贵,因此这具有挑战性。使用机器学习模型进行预测提供了高性能的替代方案。
我们提出了CrisprBERT,这是一种深度学习模型,它结合了来自Transformer(BERT)架构的双向编码器表示,为配对的sgRNA和DNA序列提供高维嵌入,并结合双向长短期记忆网络进行学习,以仅利用sgRNA及其配对的DNA序列来预测sgRNA的脱靶效应。我们提出了双峰堆叠编码来捕获Cas9结合的局部能量配置,并应用BERT模型来学习双峰对的上下文嵌入。我们的结果表明,在单分割和留一sgRNA交叉验证以及独立测试方面,新模型比现有最先进的深度学习模型表现更好。
CrisprBERT可在GitHub上获取:https://github.com/OSsari/CrisprBERT。