School of Mathematics and Computer science, Zhejiang A&F University, Hangzhou, China.
College of Landscape Architecture, Beijing Forestry University, Beijing, China.
PLoS Comput Biol. 2024 Sep 3;20(9):e1012340. doi: 10.1371/journal.pcbi.1012340. eCollection 2024 Sep.
The off-target activities within the CRISPR-Cas9 system remains a formidable barrier to its broader application and development. Recent advancements have highlighted the potential of deep learning models in predicting these off-target effects, yet they encounter significant hurdles including imbalances within datasets and the intricacies associated with encoding schemes and model architectures. To surmount these challenges, our study innovatively introduces an Efficiency and Specificity-Based (ESB) class rebalancing strategy, specifically devised for datasets featuring mismatches-only off-target instances, marking a pioneering approach in this realm. Furthermore, through a meticulous evaluation of various One-hot encoding schemes alongside numerous hybrid neural network models, we discern that encoding and models of moderate complexity ideally balance performance and efficiency. On this foundation, we advance a novel hybrid model, the CRISPR-MCA, which capitalizes on multi-feature extraction to enhance predictive accuracy. The empirical results affirm that the ESB class rebalancing strategy surpasses five conventional methods in addressing extreme dataset imbalances, demonstrating superior efficacy and broader applicability across diverse models. Notably, the CRISPR-MCA model excels in off-target effect prediction across four distinct mismatches-only datasets and significantly outperforms contemporary state-of-the-art models in datasets comprising both mismatches and indels. In summation, the CRISPR-MCA model, coupled with the ESB rebalancing strategy, offers profound insights and a robust framework for future explorations in this field.
CRISPR-Cas9 系统中的脱靶活性仍然是其更广泛应用和发展的一个巨大障碍。最近的进展强调了深度学习模型在预测这些脱靶效应方面的潜力,但它们遇到了重大障碍,包括数据集中的不平衡以及与编码方案和模型架构相关的复杂性。为了克服这些挑战,我们的研究创新性地引入了一种基于效率和特异性的(ESB)类重新平衡策略,专门为仅具有错配的脱靶实例的数据集设计,这是该领域的开创性方法。此外,通过对各种 One-hot 编码方案和多种混合神经网络模型进行细致的评估,我们发现编码和中等复杂程度的模型可以理想地平衡性能和效率。在此基础上,我们提出了一种新的混合模型 CRISPR-MCA,该模型利用多特征提取来提高预测准确性。实证结果证实,ESB 类重新平衡策略在解决极端数据集不平衡问题方面优于五种传统方法,在各种模型中表现出更好的效果和更广泛的适用性。值得注意的是,CRISPR-MCA 模型在四个不同的仅错配数据集的脱靶效应预测方面表现出色,并且在包含错配和插入缺失的数据集方面明显优于当代最先进的模型。总之,CRISPR-MCA 模型与 ESB 重新平衡策略相结合,为该领域的未来探索提供了深刻的见解和强大的框架。