Hou Yalin, Li Yiming, Zheng Ruiqing, Zhang Fuhao, Guo Fei, Li Min, Zeng Min
School of Computer Science and Engineering, Central South University, Changsha 410083, China.
College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf385.
Accurate prediction of single-guide RNA (sgRNA) activity is crucial for optimizing the CRISPR/Cas9 gene-editing system, as it directly influences the efficiency and accuracy of genome modifications. However, existing prediction methods mainly rely on large-scale experimental data of a single Cas9 variant to construct Cas9 protein (variants)-specific sgRNA activity prediction models, which limits their generalization ability and prediction performance across different Cas9 protein (variants), as well as their scalability to the continuously discovered new variants.
In this study, we proposed PLM-CRISPR, a novel deep learning-based model that leverages protein language models to capture Cas9 protein (variants) representations for cross-variant sgRNA activity prediction. PLM-CRISPR uses tailored feature extraction modules for both sgRNA and protein sequences, incorporating a cross-variant training strategy and a dynamic feature fusion mechanism to effectively model their interactions. Extensive experiments demonstrate that PLM-CRISPR outperforms existing methods across datasets spanning seven Cas9 protein (variants) in three real-world scenarios, demonstrating its superior performance in handling data-scarce situations, including cases with few or no samples for novel variants. Comparative analyses with traditional machine learning and deep learning models further confirm the effectiveness of PLM-CRISPR. Additionally, motif analysis reveals that PLM-CRISPR accurately identifies high-activity sgRNA sequence patterns across diverse Cas9 protein (variants). Overall, PLM-CRISPR provides a robust, scalable, and generalizable solution for sgRNA activity prediction across diverse Cas9 protein (variants).
The source code can be obtained from https://github.com/CSUBioGroup/PLM-CRISPR.
准确预测单导向RNA(sgRNA)活性对于优化CRISPR/Cas9基因编辑系统至关重要,因为它直接影响基因组修饰的效率和准确性。然而,现有的预测方法主要依赖单个Cas9变体的大规模实验数据来构建特定于Cas9蛋白(变体)的sgRNA活性预测模型,这限制了它们在不同Cas9蛋白(变体)之间的泛化能力和预测性能,以及它们对不断发现的新变体的可扩展性。
在本研究中,我们提出了PLM-CRISPR,这是一种基于深度学习的新型模型,它利用蛋白质语言模型来捕获Cas9蛋白(变体)的表征,以进行跨变体sgRNA活性预测。PLM-CRISPR针对sgRNA和蛋白质序列使用了定制的特征提取模块,纳入了跨变体训练策略和动态特征融合机制,以有效建模它们之间的相互作用。广泛的实验表明,在三种实际场景中,PLM-CRISPR在跨越七种Cas9蛋白(变体)的数据集上优于现有方法,证明了其在处理数据稀缺情况(包括新变体样本很少或没有样本的情况)方面的卓越性能。与传统机器学习和深度学习模型的比较分析进一步证实了PLM-CRISPR的有效性。此外,基序分析表明,PLM-CRISPR能够准确识别不同Cas9蛋白(变体)中的高活性sgRNA序列模式。总体而言,PLM-CRISPR为跨不同Cas9蛋白(变体)的sgRNA活性预测提供了一个强大、可扩展且通用的解决方案。