Kaur Karambir, Gupta Amit Kumar, Rajput Akanksha, Kumar Manoj
Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research, Sector 39A, Chandigarh-160036, India.
Sci Rep. 2016 Sep 1;6:30870. doi: 10.1038/srep30870.
Genome editing by sgRNA a component of CRISPR/Cas system emerged as a preferred technology for genome editing in recent years. However, activity and stability of sgRNA in genome targeting is greatly influenced by its sequence features. In this endeavor, a few prediction tools have been developed to design effective sgRNAs but these methods have their own limitations. Therefore, we have developed "ge-CRISPR" using high throughput data for the prediction and analysis of sgRNAs genome editing efficiency. Predictive models were employed using SVM for developing pipeline-1 (classification) and pipeline-2 (regression) using 2090 and 4139 experimentally verified sgRNAs respectively from Homo sapiens, Mus musculus, Danio rerio and Xenopus tropicalis. During 10-fold cross validation we have achieved accuracy and Matthew's correlation coefficient of 87.70% and 0.75 for pipeline-1 on training dataset (T(1840)) while it performed equally well on independent dataset (V(250)). In pipeline-2 we attained Pearson correlation coefficient of 0.68 and 0.69 using best models on training (T(3169)) and independent dataset (V(520)) correspondingly. ge-CRISPR (http://bioinfo.imtech.res.in/manojk/gecrispr/) for a given genomic region will identify potent sgRNAs, their qualitative as well as quantitative efficiencies along with potential off-targets. It will be useful to scientific community engaged in CRISPR research and therapeutics development.
作为CRISPR/Cas系统组成部分的sgRNA介导的基因组编辑近年来已成为基因组编辑的首选技术。然而,sgRNA在基因组靶向中的活性和稳定性受其序列特征的影响很大。在这一过程中,已经开发了一些预测工具来设计有效的sgRNA,但这些方法都有其自身的局限性。因此,我们利用高通量数据开发了“ge-CRISPR”,用于预测和分析sgRNA的基因组编辑效率。使用支持向量机(SVM)构建预测模型,分别使用来自智人、小家鼠、斑马鱼和热带爪蟾的2090个和4139个经过实验验证的sgRNA开发了管道1(分类)和管道2(回归)。在10倍交叉验证过程中,我们在训练数据集(T(1840))上对管道1实现了87.70%的准确率和0.75的马修斯相关系数,而在独立数据集(V(250))上表现同样良好。在管道2中,我们在训练集(T(3169))和独立数据集(V(520))上分别使用最佳模型获得了0.68和0.69的皮尔逊相关系数。对于给定的基因组区域,ge-CRISPR(http://bioinfo.imtech.res.in/manojk/gecrispr/)将识别有效的sgRNA、它们的定性和定量效率以及潜在的脱靶效应。这将对从事CRISPR研究和治疗开发的科学界有用。