Das Jutan, Kumar Sanjeev, Mishra Dwijesh Chandra, Chaturvedi Krishna Kumar, Paul Ranjit Kumar, Kairi Amit
ICAR-Indian Agricultural Research Institute, New Delhi, India.
ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.
Front Genet. 2023 Jan 9;13:1085332. doi: 10.3389/fgene.2022.1085332. eCollection 2022.
CRISPR-Cas9 system is one of the recent most used genome editing techniques. Despite having a high capacity to alter the precise target genes and genomic regions that the planned guide RNA (or sgRNA) complements, the off-target effect still exists. But there are already machine learning algorithms for people, animals, and a few plant species. In this paper, an effort has been made to create models based on three machine learning-based techniques [namely, artificial neural networks (ANN), support vector machines (SVM), and random forests (RF)] for the prediction of the CRISPR-Cas9 cleavage sites that will be cleaved by a particular sgRNA. The plant dataset was the sole source of inspiration for all of these machine learning-based algorithms. 70% of the on-target and off-target dataset of various plant species that was gathered was used to train the models. The remaining 30% of the data set was used to evaluate the model's performance using a variety of evaluation metrics, including specificity, sensitivity, accuracy, precision, F1 score, F2 score, and AUC. Based on the aforementioned machine learning techniques, eleven models in all were developed. Comparative analysis of these produced models suggests that the model based on the random forest technique performs better. The accuracy of the Random Forest model is 96.27%, while the AUC value was found to be 99.21%. The SVM-Linear, SVM-Polynomial, SVM-Gaussian, and SVM-Sigmoid models were trained, making a total of six ANN-based models (ANN1-Logistic, ANN1-Tanh, ANN1-ReLU, ANN2-Logistic, ANN2-Tanh, and ANN-ReLU) and Support Vector Machine models (SVM-Linear, SVM-Polynomial, SVM-Gaussian However, the overall performance of Random Forest is better among all other ML techniques. ANN1-ReLU and SVM-Linear model performance were shown to be better among Artificial Neural Network and Support Vector Machine-based models, respectively.
CRISPR-Cas9系统是近年来使用最为频繁的基因组编辑技术之一。尽管它能够高效地改变特定引导RNA(或sgRNA)互补的精确目标基因和基因组区域,但脱靶效应仍然存在。不过,针对人类、动物和少数植物物种,已经有了机器学习算法。在本文中,我们尝试基于三种机器学习技术[即人工神经网络(ANN)、支持向量机(SVM)和随机森林(RF)]创建模型,用于预测特定sgRNA将会切割的CRISPR-Cas9切割位点。植物数据集是所有这些基于机器学习的算法的唯一灵感来源。收集到的各种植物物种的70%的靶向和脱靶数据集用于训练模型。其余30%的数据集用于使用多种评估指标评估模型的性能,包括特异性、敏感性、准确性、精确性、F1分数、F2分数和AUC。基于上述机器学习技术,总共开发了11个模型。对这些生成模型的比较分析表明,基于随机森林技术的模型表现更好。随机森林模型的准确率为96.27%,而AUC值为99.21%。训练了SVM-线性、SVM-多项式、SVM-高斯和SVM- sigmoid模型,总共六个基于人工神经网络的模型(ANN1-逻辑、ANN1-双曲正切、ANN1-修正线性单元、ANN2-逻辑、ANN2-双曲正切和ANN-修正线性单元)和支持向量机模型(SVM-线性、SVM-多项式、SVM-高斯)。然而,在所有其他机器学习技术中,随机森林的整体性能更好。在基于人工神经网络和支持向量机的模型中,分别显示ANN1-修正线性单元和SVM-线性模型的性能更好。