Wilson Laurence O W, Reti Daniel, O'Brien Aidan R, Dunne Robert A, Bauer Denis C
1 Health and Biosecurity, CSIRO , Sydney, Australia .
2 Faculty of Engineering, UNSW , Sydney, Australia .
CRISPR J. 2018 Apr;1:182-190. doi: 10.1089/crispr.2017.0021.
The activity of CRISPR-Cas9 target sites can be measured experimentally through phenotypic assays or mutation rate and used to build computational models to predict activity of novel target sites. However, currently published models have been reported to perform poorly in situations other than their training conditions. In this study, we hence investigate how different sources of data influence predictive power and identify the best data set for the most robust predictive model. We use the activity of 28,606 target sites and a machine learning approach to train a predictive model of CRISPR-Cas9 activity, outperforming other published methods by an average increase in accuracy of 80% for prediction of the degree of activity and 13% for classification into active and inactive categories. We find that using data sets that measure CRISPR-Cas9 activity through sequencing provides more accurate predictions of activity. Our model, dubbed TUSCAN, is highly scalable, predicting the activity of 5000 target sites in under 7 s, making it suitable for genome-wide screens. We conclude that sophisticated machine learning methods can classify binary CRISPR-Cas9 activity; however, predicting fine-scale activity scores will require larger data sets directly measuring Indel insertion rate.
CRISPR-Cas9靶点的活性可以通过表型分析或突变率进行实验测量,并用于构建计算模型来预测新靶点的活性。然而,据报道,目前已发表的模型在其训练条件以外的情况下表现不佳。因此,在本研究中,我们调查了不同数据源如何影响预测能力,并为最稳健的预测模型确定最佳数据集。我们使用28,606个靶点的活性和机器学习方法来训练CRISPR-Cas9活性的预测模型,在预测活性程度时,准确率平均提高了80%,在将靶点分类为活性和非活性类别时,准确率平均提高了13%,优于其他已发表的方法。我们发现,通过测序测量CRISPR-Cas9活性的数据集能提供更准确的活性预测。我们的模型名为TUSCAN,具有高度可扩展性,能在不到7秒的时间内预测5000个靶点的活性,适用于全基因组筛选。我们得出结论,复杂的机器学习方法可以对二元CRISPR-Cas9活性进行分类;然而,预测精细尺度的活性分数将需要直接测量Indel插入率的更大数据集。