Department of Statistics and Data Science, Northwestern University, Evanston, IL, USA.
Department of Molecular BioSciences, Northwestern University, Evanston, IL, USA.
BMC Bioinformatics. 2022 Oct 26;23(1):446. doi: 10.1186/s12859-022-04998-z.
In the CRISPR-Cas9 system, the efficiency of genetic modifications has been found to vary depending on the single guide RNA (sgRNA) used. A variety of sgRNA properties have been found to be predictive of CRISPR cleavage efficiency, including the position-specific sequence composition of sgRNAs, global sgRNA sequence properties, and thermodynamic features. While prevalent existing deep learning-based approaches provide competitive prediction accuracy, a more interpretable model is desirable to help understand how different features may contribute to CRISPR-Cas9 cleavage efficiency.
We propose a gradient boosting approach, utilizing LightGBM to develop an integrated tool, BoostMEC (Boosting Model for Efficient CRISPR), for the prediction of wild-type CRISPR-Cas9 editing efficiency. We benchmark BoostMEC against 10 popular models on 13 external datasets and show its competitive performance.
BoostMEC can provide state-of-the-art predictions of CRISPR-Cas9 cleavage efficiency for sgRNA design and selection. Relying on direct and derived sequence features of sgRNA sequences and based on conventional machine learning, BoostMEC maintains an advantage over other state-of-the-art CRISPR efficiency prediction models that are based on deep learning through its ability to produce more interpretable feature insights and predictions.
在 CRISPR-Cas9 系统中,已发现遗传修饰的效率因所用的单指导 RNA(sgRNA)而异。已经发现 sgRNA 的多种特性可预测 CRISPR 切割效率,包括 sgRNA 的位置特异性序列组成、全局 sgRNA 序列特性和热力学特性。虽然流行的基于深度学习的方法提供了有竞争力的预测准确性,但需要一个更具可解释性的模型来帮助理解不同的特征如何可能有助于 CRISPR-Cas9 切割效率。
我们提出了一种梯度提升方法,利用 LightGBM 开发了一种集成工具,即 BoostMEC(用于有效 CRISPR 的提升模型),用于预测野生型 CRISPR-Cas9 编辑效率。我们在 13 个外部数据集上针对 10 个流行模型对 BoostMEC 进行了基准测试,并展示了其竞争性能。
BoostMEC 可以为 sgRNA 设计和选择提供最先进的 CRISPR-Cas9 切割效率预测。BoostMEC 基于 sgRNA 序列的直接和推导序列特征,并基于传统机器学习,通过能够生成更具可解释性的特征见解和预测,相对于其他基于深度学习的最先进的 CRISPR 效率预测模型具有优势。