利用蛋白质语言模型进行跨变体CRISPR/Cas9 sgRNA活性预测。

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction.

作者信息

Hou Yalin, Li Yiming, Zheng Ruiqing, Zhang Fuhao, Guo Fei, Li Min, Zeng Min

机构信息

School of Computer Science and Engineering, Central South University, Changsha 410083, China.

College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China.

出版信息

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf385.

DOI:10.1093/bioinformatics/btaf385

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12254127/

Abstract

MOTIVATION

Accurate prediction of single-guide RNA (sgRNA) activity is crucial for optimizing the CRISPR/Cas9 gene-editing system, as it directly influences the efficiency and accuracy of genome modifications. However, existing prediction methods mainly rely on large-scale experimental data of a single Cas9 variant to construct Cas9 protein (variants)-specific sgRNA activity prediction models, which limits their generalization ability and prediction performance across different Cas9 protein (variants), as well as their scalability to the continuously discovered new variants.

RESULTS

In this study, we proposed PLM-CRISPR, a novel deep learning-based model that leverages protein language models to capture Cas9 protein (variants) representations for cross-variant sgRNA activity prediction. PLM-CRISPR uses tailored feature extraction modules for both sgRNA and protein sequences, incorporating a cross-variant training strategy and a dynamic feature fusion mechanism to effectively model their interactions. Extensive experiments demonstrate that PLM-CRISPR outperforms existing methods across datasets spanning seven Cas9 protein (variants) in three real-world scenarios, demonstrating its superior performance in handling data-scarce situations, including cases with few or no samples for novel variants. Comparative analyses with traditional machine learning and deep learning models further confirm the effectiveness of PLM-CRISPR. Additionally, motif analysis reveals that PLM-CRISPR accurately identifies high-activity sgRNA sequence patterns across diverse Cas9 protein (variants). Overall, PLM-CRISPR provides a robust, scalable, and generalizable solution for sgRNA activity prediction across diverse Cas9 protein (variants).

AVAILABILITY AND IMPLEMENTATION

The source code can be obtained from https://github.com/CSUBioGroup/PLM-CRISPR.

摘要

动机

准确预测单导向RNA（sgRNA）活性对于优化CRISPR/Cas9基因编辑系统至关重要，因为它直接影响基因组修饰的效率和准确性。然而，现有的预测方法主要依赖单个Cas9变体的大规模实验数据来构建特定于Cas9蛋白（变体）的sgRNA活性预测模型，这限制了它们在不同Cas9蛋白（变体）之间的泛化能力和预测性能，以及它们对不断发现的新变体的可扩展性。

结果

在本研究中，我们提出了PLM-CRISPR，这是一种基于深度学习的新型模型，它利用蛋白质语言模型来捕获Cas9蛋白（变体）的表征，以进行跨变体sgRNA活性预测。PLM-CRISPR针对sgRNA和蛋白质序列使用了定制的特征提取模块，纳入了跨变体训练策略和动态特征融合机制，以有效建模它们之间的相互作用。广泛的实验表明，在三种实际场景中，PLM-CRISPR在跨越七种Cas9蛋白（变体）的数据集上优于现有方法，证明了其在处理数据稀缺情况（包括新变体样本很少或没有样本的情况）方面的卓越性能。与传统机器学习和深度学习模型的比较分析进一步证实了PLM-CRISPR的有效性。此外，基序分析表明，PLM-CRISPR能够准确识别不同Cas9蛋白（变体）中的高活性sgRNA序列模式。总体而言，PLM-CRISPR为跨不同Cas9蛋白（变体）的sgRNA活性预测提供了一个强大、可扩展且通用的解决方案。

可用性和实现

源代码可从https://github.com/CSUBioGroup/PLM-CRISPR获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4bac/12254127/3232d3f2f591/btaf385f1.jpg

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验