Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.
Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.
Biomolecules. 2022 Aug 16;12(8):1123. doi: 10.3390/biom12081123.
The reprogrammable CRISPR/Cas9 genome editing tool's growing popularity is hindered by unwanted off-target effects. Efforts have been directed toward designing efficient guide RNAs as well as identifying potential off-target threats, yet factors that determine efficiency and off-target activity remain obscure. Based on sequence features, previous machine learning models performed poorly on new datasets, thus there is a need for the incorporation of novel features. The binding energy estimation of the gRNA-DNA hybrid as well as the Cas9-gRNA-DNA hybrid allowed generating better performing machine learning models for the prediction of Cas9 activity. The analysis of feature contribution towards the model output on a limited dataset indicated that energy features played a determining role along with the sequence features. The binding energy features proved essential for the prediction of on-target activity and off-target sites. The plateau, in the performance on unseen datasets, of current machine learning models could be overcome by incorporating novel features, such as binding energy, among others. The models are provided on GitHub (GitHub Inc., San Francisco, CA, USA).
可重编程的 CRISPR/Cas9 基因组编辑工具的日益普及受到了脱靶效应的阻碍。人们一直在努力设计高效的向导 RNA,以及识别潜在的脱靶威胁,但决定效率和脱靶活性的因素仍然不清楚。基于序列特征,以前的机器学习模型在新数据集上表现不佳,因此需要结合新的特征。gRNA-DNA 杂交体和 Cas9-gRNA-DNA 杂交体的结合能估计可以生成性能更好的机器学习模型,用于预测 Cas9 的活性。在有限的数据集上对特征对模型输出的贡献进行分析表明,能量特征与序列特征一样起着决定性的作用。结合能特征对于预测靶活性和脱靶位点是必不可少的。通过结合新的特征,如结合能等,可以克服当前机器学习模型在未见数据集上的性能平台期。模型已在 GitHub(美国加利福尼亚州旧金山的 GitHub Inc.)上提供。