Menon A Vipin, Sohn Jang-Il, Nam Jin-Wu
Department of Life Science, College of Natural Sciences, Hanyang University, Seoul 04763, Republic of Korea.
Research Institute for Convergence of Basic Sciences, Hanyang University, Seoul 04763, Republic of Korea.
Comput Struct Biotechnol J. 2020 Mar 25;18:814-820. doi: 10.1016/j.csbj.2020.03.020. eCollection 2020.
The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-Cas systems, including dead Cas9 (dCas9), Cas9, and Cas12a, have revolutionized genome engineering in mammalian somatic cells. Although computational tools that assess the target sites of CRISPR-Cas systems are inevitably important for designing efficient guide RNAs (gRNAs), they exhibit generalization issues in selecting features and do not provide optimal results in a comprehensive manner. Here, we introduce a Comprehensive Guide Designer (CGD) for four different CRISPR systems, which utilizes the machine learning algorithm, Elastic Net Logistic Regression (ENLOR), to autonomously generalize the models. CGD contains specific models trained with public datasets generated by CRISPRi, CRISPRa, CRISPR-Cas9, and CRISPR-Cas12a (designated as CGDi, CGDa, CGD9, and CGD12a, respectively) in an unbiased manner. The trained CGD models were benchmarked to other regression-based machine learning models, such as ElasticNet Linear Regression (ENLR), Random Forest and Boruta (RFB), and Extreme Gradient Boosting (Xgboost) with inbuilt feature selection. Evaluation with independent test datasets showed that CGD models outperformed the pre-existing methods in predicting the efficacy of gRNAs. All CGD source codes and datasets are available at GitHub (https://github.com/vipinmenon1989/CGD), and the CGD webserver can be accessed at http://big.hanyang.ac.kr:2195/CGD.
成簇规律间隔短回文重复序列(CRISPR)-Cas系统,包括无核酸酶活性的Cas9(dCas9)、Cas9和Cas12a,已经彻底改变了哺乳动物体细胞中的基因组工程。尽管评估CRISPR-Cas系统靶位点的计算工具对于设计高效的向导RNA(gRNA)来说必不可少,但它们在选择特征方面存在泛化问题,并且不能全面地提供最佳结果。在此,我们推出了一种适用于四种不同CRISPR系统的综合向导设计工具(CGD),它利用机器学习算法——弹性网逻辑回归(ENLOR)来自动泛化模型。CGD包含使用由CRISPR干扰、CRISPR激活、CRISPR-Cas9和CRISPR-Cas12a(分别命名为CGDi、CGDa、CGD9和CGD12a)生成的公共数据集以无偏方式训练的特定模型。将经过训练的CGD模型与其他基于回归的机器学习模型进行基准测试,比如具有内置特征选择功能的弹性网线性回归(ENLR)、随机森林和博鲁塔(RFB)以及极端梯度提升(Xgboost)。使用独立测试数据集进行评估表明,CGD模型在预测gRNA的有效性方面优于现有方法。所有CGD源代码和数据集均可在GitHub(https://github.com/vipinmenon1989/CGD)上获取,并且可以通过http://big.hanyang.ac.kr:2195/CGD访问CGD网络服务器。