Department of Endocrinology & Metabolism, Shanghai Tenth People's Hospital; Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 20009, China.
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China.
Brief Bioinform. 2020 Jul 15;21(4):1448-1454. doi: 10.1093/bib/bbz069.
For genome-wide CRISPR off-target cleavage sites (OTS) prediction, an important issue is data imbalance-the number of true OTS recognized by whole-genome off-target detection techniques is much smaller than that of all possible nucleotide mismatch loci, making the training of machine learning model very challenging. Therefore, computational models proposed for OTS prediction and scoring should be carefully designed and properly evaluated in order to avoid bias. In our study, two tools are taken as examples to further emphasize the data imbalance issue in CRISPR off-target prediction to achieve better sensitivity and specificity for optimized CRISPR gene editing. We would like to indicate that (1) the benchmark of CRISPR off-target prediction should be properly evaluated and not overestimated by considering data imbalance issue; (2) incorporation of efficient computational techniques (including ensemble learning and data synthesis techniques) can help to address the data imbalance issue and improve the performance of CRISPR off-target prediction. Taking together, we call for more efforts to address the data imbalance issue in CRISPR off-target prediction to facilitate clinical utility of CRISPR-based gene editing techniques.
为了进行全基因组 CRISPR 脱靶切割位点(OTS)预测,一个重要的问题是数据不平衡——全基因组脱靶检测技术识别的真正 OTS 数量远远小于所有可能的核苷酸错配位点,这使得机器学习模型的训练极具挑战性。因此,为了避免偏差,应仔细设计和适当评估用于 OTS 预测和评分的计算模型。在我们的研究中,以两个工具为例,进一步强调了 CRISPR 脱靶预测中的数据不平衡问题,以实现优化的 CRISPR 基因编辑的更好的灵敏度和特异性。我们想指出:(1) 考虑到数据不平衡问题,应适当评估和不过高估 CRISPR 脱靶预测的基准;(2) 结合有效的计算技术(包括集成学习和数据合成技术)可以帮助解决数据不平衡问题并提高 CRISPR 脱靶预测的性能。总之,我们呼吁在 CRISPR 脱靶预测中投入更多的努力来解决数据不平衡问题,以促进基于 CRISPR 的基因编辑技术的临床应用。